Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(97)

Issue 9965010: Regexp: Improve the speed that we scan for an initial point where a non-anchored (Closed)

Created:
8 years, 8 months ago by Erik Corry
Modified:
8 years, 8 months ago
Reviewers:
ulan
CC:
v8-dev
Visibility:
Public.

Description

Regexp: Improve the speed that we scan for an initial point where a non-anchored regexp can match by using a Boyer-Moore-like table. This is done by identifying non-greedy non-capturing loops in the nodes that eat any character one at a time. For example in the middle of the regexp /foo[\s\S]*?bar/ we find such a loop. There is also such a loop implicitly inserted at the start of any non-anchored regexp. When we have found such a loop we look ahead in the nodes to find the set of characters that can come at given distances. For example for the regexp /.?foo/ we know that there are at least 3 characters ahead of us, and the sets of characters that can occur are [any, [f, o], [o]]. We find a range in the lookahead info where the set of characters is reasonably constrained. In our example this is from index 1 to 2 (0 is not constrained). We can now look 3 characters ahead and if we don't find one of [f, o] (the union of [f, o] and [o]) then we can skip forwards by the range size (in this case 2). For Unicode input strings we do the same, but modulo 128. We also look at the first string fed to the regexp and use that to get a hint of the character frequencies in the inputs. This affects the assessment of whether the set of characters is 'reasonably constrained'. We still have the old lookahead mechanism, which uses a wide load of multiple characters followed by a mask and compare to determine whether a match is possible at this point. Committed: http://code.google.com/p/v8/source/detail?r=11204

Patch Set 1 #

Patch Set 2 : '' #

Patch Set 3 : '' #

Total comments: 58

Patch Set 4 : '' #

Patch Set 5 : '' #

Total comments: 6
Unified diffs Side-by-side diffs Delta from patch set Stats (+611 lines, -37 lines) Patch
M src/arm/regexp-macro-assembler-arm.cc View 1 2 3 2 chunks +12 lines, -4 lines 0 comments Download
M src/ia32/regexp-macro-assembler-ia32.cc View 1 2 3 3 chunks +6 lines, -6 lines 0 comments Download
M src/jsregexp.h View 1 2 3 14 chunks +137 lines, -3 lines 0 comments Download
M src/jsregexp.cc View 1 2 3 4 19 chunks +438 lines, -17 lines 6 comments Download
M src/x64/regexp-macro-assembler-x64.cc View 1 2 3 2 chunks +14 lines, -6 lines 0 comments Download
M test/cctest/test-regexp.cc View 1 chunk +4 lines, -1 line 0 comments Download

Messages

Total messages: 7 (0 generated)
Erik Corry
8 years, 8 months ago (2012-03-30 09:47:57 UTC) #1
Erik Corry
8 years, 8 months ago (2012-03-30 10:00:27 UTC) #2
ulan
First round of comments, mostly nits that you can fix while I am reviewing GetSkipTable ...
8 years, 8 months ago (2012-03-30 13:04:48 UTC) #3
ulan
Second round of comments. http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc File src/jsregexp.cc (right): http://codereview.chromium.org/9965010/diff/6/src/jsregexp.cc#newcode3349 src/jsregexp.cc:3349: int skip = max_lookahead - ...
8 years, 8 months ago (2012-03-30 16:54:51 UTC) #4
Erik Corry
New version uploaded. More comments and clearer heuristics. Performance is a little better, partly due ...
8 years, 8 months ago (2012-04-01 00:48:46 UTC) #5
ulan
LGTM http://codereview.chromium.org/9965010/diff/1009/src/jsregexp.cc File src/jsregexp.cc (right): http://codereview.chromium.org/9965010/diff/1009/src/jsregexp.cc#newcode3316 src/jsregexp.cc:3316: int probability = (in_quickcheck_range ? kSize / 2 ...
8 years, 8 months ago (2012-04-02 08:12:24 UTC) #6
Erik Corry
8 years, 8 months ago (2012-04-02 09:36:22 UTC) #7
http://codereview.chromium.org/9965010/diff/1009/src/jsregexp.cc
File src/jsregexp.cc (right):

http://codereview.chromium.org/9965010/diff/1009/src/jsregexp.cc#newcode3316
src/jsregexp.cc:3316: int probability = (in_quickcheck_range ? kSize / 2 :
kSize) - frequency;
On 2012/04/02 08:12:24, ulan wrote:
> This shouldn't affect the correctness but:
> 
> if I read the code correctly, the probability can be negative here, because
> frequency can be > kSize. I think the upper bound on frequency is 2*kSize.
> 
> Also, did you mean (kSize - frequency) / 2 in the quickcheck case?

This code is as intended.  I added a few comments.

http://codereview.chromium.org/9965010/diff/1009/src/jsregexp.cc#newcode3329
src/jsregexp.cc:3329: // occur occur in the subject string in the range between
min_lookahead and
On 2012/04/02 08:12:24, ulan wrote:
> 'occur' twice.

Done.

http://codereview.chromium.org/9965010/diff/1009/src/jsregexp.cc#newcode6018
src/jsregexp.cc:6018: if (chars_sampled > kSampleSize) break;
On 2012/04/02 08:12:24, ulan wrote:
> Now we can move this into the for loop condition.
> i < sample_subject->length() && chars_sampled < kSampleSize

Done.

Powered by Google App Engine
This is Rietveld 408576698