|
|
Created:
11 years, 1 month ago by James Hawkins Modified:
9 years, 7 months ago CC:
chromium-reviews_googlegroups.com, ben+cc_chromium.org Base URL:
svn://chrome-svn/chrome/trunk/src/ Visibility:
Public. |
DescriptionAdd Pattern, an implementation of a regular expression parser and matcher.
BUG=none
TEST=none
Patch Set 1 #
Total comments: 4
Patch Set 2 : '' #
Total comments: 10
Patch Set 3 : '' #
Total comments: 1
Messages
Total messages: 26 (0 generated)
I'm probably not the best person to review this, since I know nothing about regexps. Don't we already have a regexp parser somewhere? -Ben On Mon, Nov 23, 2009 at 1:48 PM, <jhawkins@chromium.org> wrote: > Reviewers: Ben Goodger, > > Description: > Add Pattern, an implementation of a regular expression parser and matcher. > > BUG=none > TEST=none > > Please review this at http://codereview.chromium.org/439008 > > SVN Base: svn://chrome-svn/chrome/trunk/src/ > > Affected files: > A chrome/browser/autofill/pattern.h > M chrome/chrome.gyp > > >
What's this needed for? http://codereview.chromium.org/439008/diff/1/2 File chrome/browser/autofill/pattern.h (right): http://codereview.chromium.org/439008/diff/1/2#newcode22 chrome/browser/autofill/pattern.h:22: // a|b : match either a or b Can you add a real BNF here to make clear how these can be combined?
Also, v8 has a full regex engine (irregexp) and extensions has a simple URL pattern matcher (for user scripts). Do we really need a third one?
On 2009/11/23 23:43:38, tony wrote: > Also, v8 has a full regex engine (irregexp) and extensions has a simple URL > pattern matcher (for user scripts). Do we really need a third one? Correct me if I'm wrong, but from my earlier investigations, the v8 regex engine only works for the interpreter. The extension pattern matcher is for URLs only, which is not what is needed for autofill.
On 2009/11/23 23:57:49, James Hawkins wrote: > Correct me if I'm wrong, but from my earlier investigations, the v8 regex engine > only works for the interpreter. The extension pattern matcher is for URLs only, > which is not what is needed for autofill. Oh, I also forgot jsre used by WebCore. Can you answer Nico's question as to what his is needed for (i.e., provide some use cases)? Maybe there's another way to accomplish the same goal. Also, just quickly glancing at the code, it looks like a bad pattern can easily smash the stack by recursing too much. Is it possible to put something like this in the renderer process? It's often easy to make patterns that take an infinite time to execute, but I'm not sure if that's the case with your grammar. Where do the patterns and pattern inputs come from?
On 2009/11/24 00:10:29, tony wrote: > On 2009/11/23 23:57:49, James Hawkins wrote: > > Correct me if I'm wrong, but from my earlier investigations, the v8 regex > engine > > only works for the interpreter. The extension pattern matcher is for URLs > only, > > which is not what is needed for autofill. > > Oh, I also forgot jsre used by WebCore. > > Can you answer Nico's question as to what his is needed for (i.e., provide some > use cases)? Maybe there's another way to accomplish the same goal. > This is used by the autofill heuristic code. An example pattern to match the last name field in a form: "last name|lname|surname" The patterns are created by the user of the API (autofill), and the matcher is matching the patterns against form field names and labels from any web page. > Also, just quickly glancing at the code, it looks like a bad pattern can easily > smash the stack by recursing too much. Is it possible to put something like > this in the renderer process? > Are you asking if we can move the regexp matcher to the renderer process as a form of sandboxing? > It's often easy to make patterns that take an infinite time to execute, but I'm > not sure if that's the case with your grammar. Where do the patterns and > pattern inputs come from? See above. I'm in the process of writing up the grammar in BNF. I do need to write unit tests for Pattern, but right now I'm just trying to get all the autofill heuristics code I've written so far committed before it becomes anymore unwieldy.
We use v8 in the browser process for processing proxy pac files. Have you considered using v8 for these patterns as well? If we control all the patterns, then it is probably safe to add another pattern engine (the patterns seem distinct enough from the existing pattern engines), but putting it in the renderer would help to sandbox this (both from blocking the UI and from crashes).
http://codereview.chromium.org/439008/diff/1/2 File chrome/browser/autofill/pattern.h (right): http://codereview.chromium.org/439008/diff/1/2#newcode31 chrome/browser/autofill/pattern.h:31: static Term* Parse(const string16 &p); So, if this is to be used by code, why do you need a parsing function?
http://codereview.chromium.org/439008/diff/1/2 File chrome/browser/autofill/pattern.h (right): http://codereview.chromium.org/439008/diff/1/2#newcode22 chrome/browser/autofill/pattern.h:22: // a|b : match either a or b On 2009/11/23 23:16:00, Nico wrote: > Can you add a real BNF here to make clear how these can be combined? Done. Can you check it out? It's been a while since I've written a grammar out.
A few more stupid questions :-) http://codereview.chromium.org/439008/diff/2003/2004 File chrome/browser/autofill/pattern.h (right): http://codereview.chromium.org/439008/diff/2003/2004#newcode21 chrome/browser/autofill/pattern.h:21: // <webstring> ::= "@" <basic-RE> s/basic-RE/string/ ? http://codereview.chromium.org/439008/diff/2003/2004#newcode23 chrome/browser/autofill/pattern.h:23: // <char> ::= any ASCII character What about unicode? http://codereview.chromium.org/439008/diff/2003/2005 File chrome/chrome.gyp (right): http://codereview.chromium.org/439008/diff/2003/2005#newcode854 chrome/chrome.gyp:854: 'browser/autofill/address_view.h', Did you forget to add these to the CL or will you remove them before committing? (have you heard of git? :-P)
http://codereview.chromium.org/439008/diff/1/2 File chrome/browser/autofill/pattern.h (right): http://codereview.chromium.org/439008/diff/1/2#newcode31 chrome/browser/autofill/pattern.h:31: static Term* Parse(const string16 &p); On 2009/11/24 01:03:50, Nico wrote: > So, if this is to be used by code, why do you need a parsing function? I'm not sure I understand the question. The user of the API creates a Pattern object, passing in the regex pattern, and then the user calls Match() for each string he wants to match the pattern against. Pattern pattern("last name|surname"); if (pattern.match(field1->name()) || pattern.match(field2->name()) { ... }
On Mon, Nov 23, 2009 at 5:09 PM, <jhawkins@chromium.org> wrote: > > http://codereview.chromium.org/439008/diff/1/2 > > File chrome/browser/autofill/pattern.h (right): > > http://codereview.chromium.org/439008/diff/1/2#newcode31 > chrome/browser/autofill/pattern.h:31: static Term* Parse(const string16 > &p); > On 2009/11/24 01:03:50, Nico wrote: > >> So, if this is to be used by code, why do you need a parsing function? >> > > I'm not sure I understand the question. The user of the API creates a > Pattern object, passing in the regex pattern, and then the user calls > Match() for each string he wants to match the pattern against. > > Pattern pattern("last name|surname"); > Why wouldn't the caller do Pattern pattern(OrPattern(LiteralPattern("last name"), LiteralPattern("surname)); ? > if (pattern.match(field1->name()) || > pattern.match(field2->name()) { > ... > > } > > http://codereview.chromium.org/439008 >
http://codereview.chromium.org/439008/diff/2003/2004 File chrome/browser/autofill/pattern.h (right): http://codereview.chromium.org/439008/diff/2003/2004#newcode21 chrome/browser/autofill/pattern.h:21: // <webstring> ::= "@" <basic-RE> On 2009/11/24 01:08:34, Nico wrote: > s/basic-RE/string/ ? The idea is that '&' and '^' can only prefix a string (as in you can't combine them like '&^apple'), while the '@' modifier can be comibined with either one of these symbols. So '@&apple' will match any string containing a word beginning with 'apple', but only in web page text. http://codereview.chromium.org/439008/diff/2003/2004#newcode23 chrome/browser/autofill/pattern.h:23: // <char> ::= any ASCII character On 2009/11/24 01:08:34, Nico wrote: > What about unicode? Yes, this should be unicode, not ASCII. http://codereview.chromium.org/439008/diff/2003/2005 File chrome/chrome.gyp (right): http://codereview.chromium.org/439008/diff/2003/2005#newcode854 chrome/chrome.gyp:854: 'browser/autofill/address_view.h', On 2009/11/24 01:08:34, Nico wrote: > Did you forget to add these to the CL or will you remove them before committing? > > (have you heard of git? :-P) Forgot to take them back out. I use git for other projects, but I've tried it many times for chrome development, and it just never really worked for me :-/
http://codereview.chromium.org/439008/diff/2003/2004 File chrome/browser/autofill/pattern.h (right): http://codereview.chromium.org/439008/diff/2003/2004#newcode21 chrome/browser/autofill/pattern.h:21: // <webstring> ::= "@" <basic-RE> On 2009/11/24 01:18:13, James Hawkins wrote: > On 2009/11/24 01:08:34, Nico wrote: > > s/basic-RE/string/ ? > > The idea is that '&' and '^' can only prefix a string (as in you can't combine > them like '&^apple'), while the '@' modifier can be comibined with either one of > these symbols. So '@&apple' will match any string containing a word beginning > with 'apple', but only in web page text. Oh, you mean "simple-RE" then. "basic-RE" isn't defined anywhere.
http://codereview.chromium.org/439008/diff/2003/2004 File chrome/browser/autofill/pattern.h (right): http://codereview.chromium.org/439008/diff/2003/2004#newcode21 chrome/browser/autofill/pattern.h:21: // <webstring> ::= "@" <basic-RE> On 2009/11/24 01:20:07, Nico wrote: > On 2009/11/24 01:18:13, James Hawkins wrote: > > On 2009/11/24 01:08:34, Nico wrote: > > > s/basic-RE/string/ ? > > > > The idea is that '&' and '^' can only prefix a string (as in you can't combine > > them like '&^apple'), while the '@' modifier can be comibined with either one > of > > these symbols. So '@&apple' will match any string containing a word beginning > > with 'apple', but only in web page text. > > Oh, you mean "simple-RE" then. "basic-RE" isn't defined anywhere. Also, as is this allows things like "@@@@@@@@@foo" which is probably not intended.
On 2009/11/24 01:13:21, Nico wrote: > On Mon, Nov 23, 2009 at 5:09 PM, <mailto:jhawkins@chromium.org> wrote: > > > > > http://codereview.chromium.org/439008/diff/1/2 > > > > File chrome/browser/autofill/pattern.h (right): > > > > http://codereview.chromium.org/439008/diff/1/2#newcode31 > > chrome/browser/autofill/pattern.h:31: static Term* Parse(const string16 > > &p); > > On 2009/11/24 01:03:50, Nico wrote: > > > >> So, if this is to be used by code, why do you need a parsing function? > >> > > > > I'm not sure I understand the question. The user of the API creates a > > Pattern object, passing in the regex pattern, and then the user calls > > Match() for each string he wants to match the pattern against. > > > > Pattern pattern("last name|surname"); > > > > Why wouldn't the caller do > > Pattern pattern(OrPattern(LiteralPattern("last name"), > LiteralPattern("surname)); > > ? > a) Doesn't scale as well. b) Not very readable. c) Convenience of the caller. Imagine: Pattern pattern(OrPattern(LiteralPattern("^hello"), OrPattern(LiteralPattern("&begin"), OrPattern(LiteralPattern("@webtext"), LiteralPattern("anytext"))))); as opposed to Pattern pattern("^hello|&begin|@webtext|anytext");
http://codereview.chromium.org/439008/diff/2003/2004 File chrome/browser/autofill/pattern.h (right): http://codereview.chromium.org/439008/diff/2003/2004#newcode21 chrome/browser/autofill/pattern.h:21: // <webstring> ::= "@" <basic-RE> On 2009/11/24 01:20:07, Nico wrote: > On 2009/11/24 01:18:13, James Hawkins wrote: > > On 2009/11/24 01:08:34, Nico wrote: > > > s/basic-RE/string/ ? > > > > The idea is that '&' and '^' can only prefix a string (as in you can't combine > > them like '&^apple'), while the '@' modifier can be comibined with either one > of > > these symbols. So '@&apple' will match any string containing a word beginning > > with 'apple', but only in web page text. > > Oh, you mean "simple-RE" then. "basic-RE" isn't defined anywhere. Yes, fixed. Thanks. http://codereview.chromium.org/439008/diff/2003/2004#newcode21 chrome/browser/autofill/pattern.h:21: // <webstring> ::= "@" <basic-RE> On 2009/11/24 01:21:14, Nico wrote: > On 2009/11/24 01:20:07, Nico wrote: > > On 2009/11/24 01:18:13, James Hawkins wrote: > > > On 2009/11/24 01:08:34, Nico wrote: > > > > s/basic-RE/string/ ? > > > > > > The idea is that '&' and '^' can only prefix a string (as in you can't > combine > > > them like '&^apple'), while the '@' modifier can be comibined with either > one > > of > > > these symbols. So '@&apple' will match any string containing a word > beginning > > > with 'apple', but only in web page text. > > > > Oh, you mean "simple-RE" then. "basic-RE" isn't defined anywhere. > > Also, as is this allows things like "@@@@@@@@@foo" which is probably not > intended. Yea I was just thinking about that. I need to split out webstring.
On 2009/11/24 01:29:06, James Hawkins wrote: > > > > Also, as is this allows things like "@@@@@@@@@foo" which is probably not > > intended. > > Yea I was just thinking about that. I need to split out webstring. Patch set 3 has webstring split out.
> > Why wouldn't the caller do > > > > Pattern pattern(OrPattern(LiteralPattern("last name"), > > LiteralPattern("surname)); > > > > ? > > > > a) Doesn't scale as well. > b) Not very readable. > c) Convenience of the caller. > > Imagine: > > Pattern pattern(OrPattern(LiteralPattern("^hello"), > OrPattern(LiteralPattern("&begin"), OrPattern(LiteralPattern("@webtext"), > LiteralPattern("anytext"))))); > > as opposed to > > Pattern pattern("^hello|&begin|@webtext|anytext"); This is true, but 1.) I guess the calling code will not put literal strings there but compose the pattern of several sub-patterns. At that point, I'm not sure if string concatenation is cleaner or less code than building objects. 2.) This argument can be made for any class. If you think this is the way to go, I won't stop you, but I'm not entirely convinced (and using objects would also work around the syntax ambiguities I mention below). If you want to go with the parsing, maybe Parser and AST classes should be separate?
On 2009/11/24 01:47:10, Nico wrote: > > > Why wouldn't the caller do > > > > > > Pattern pattern(OrPattern(LiteralPattern("last name"), > > > LiteralPattern("surname)); > > > > > > ? > > > > > > > a) Doesn't scale as well. > > b) Not very readable. > > c) Convenience of the caller. > > > > Imagine: > > > > Pattern pattern(OrPattern(LiteralPattern("^hello"), > > OrPattern(LiteralPattern("&begin"), OrPattern(LiteralPattern("@webtext"), > > LiteralPattern("anytext"))))); > > > > as opposed to > > > > Pattern pattern("^hello|&begin|@webtext|anytext"); > > This is true, but > > 1.) I guess the calling code will not put literal strings there but compose the > pattern of several sub-patterns. At that point, I'm not sure if string > concatenation is cleaner or less code than building objects. I understand your concern, but this class has a limited use case, which is just autofill. I've kept it out of base for that reason. I just counted, and there are 33 instances where we have a pattern of the form "a|b|c" as a constant literal. No patterns are generated (composed), and certainly no external data is used (from the web page). This will also be backed up by unit tests eventually. I'm even willing to write them now, if that will convince you. > 2.) This argument can be made for any class. > > If you think this is the way to go, I won't stop you, but I'm not entirely > convinced (and using objects would also work around the syntax ambiguities I > mention below). > Which syntax ambiguities are you referring to? > If you want to go with the parsing, maybe Parser and AST classes should be > separate?
Below. Sorry, that mondrian doesn't include comments when hitting "reply" gets me all the time. http://codereview.chromium.org/439008/diff/7003/7004 File chrome/browser/autofill/pattern.h (right): http://codereview.chromium.org/439008/diff/7003/7004#newcode24 chrome/browser/autofill/pattern.h:24: // <char> ::= any unicode character So "@" is a literal string that matches "@", "@@" is a literal string that matches "@" only in web page text, "@^" matches "^" only in web page text, "@^a" matches "a" exactly in web page text?
It looks like it might be okay (note the double-hedge), but this is one of those cases in which I'd be much happier if it had a unit test, especially since then I could more easily download and apply the patch, and then try to make it blow up.
On 2009/11/24 03:35:53, viettrungluu wrote: > It looks like it might be okay (note the double-hedge), but this is one of those > cases in which I'd be much happier if it had a unit test, especially since then > I could more easily download and apply the patch, and then try to make it blow > up. Sure, I'm actually writing up unit tests right now.
On 2009/11/24 01:28:54, James Hawkins wrote: > On 2009/11/24 01:13:21, Nico wrote: > > On Mon, Nov 23, 2009 at 5:09 PM, <mailto:jhawkins@chromium.org> wrote: > > > > > > > > http://codereview.chromium.org/439008/diff/1/2 > > > > > > File chrome/browser/autofill/pattern.h (right): > > > > > > http://codereview.chromium.org/439008/diff/1/2#newcode31 > > > chrome/browser/autofill/pattern.h:31: static Term* Parse(const string16 > > > &p); > > > On 2009/11/24 01:03:50, Nico wrote: > > > > > >> So, if this is to be used by code, why do you need a parsing function? > > >> > > > > > > I'm not sure I understand the question. The user of the API creates a > > > Pattern object, passing in the regex pattern, and then the user calls > > > Match() for each string he wants to match the pattern against. > > > > > > Pattern pattern("last name|surname"); > > > > > > > Why wouldn't the caller do > > > > Pattern pattern(OrPattern(LiteralPattern("last name"), > > LiteralPattern("surname)); > > > > ? > > > > a) Doesn't scale as well. > b) Not very readable. > c) Convenience of the caller. > > Imagine: > > Pattern pattern(OrPattern(LiteralPattern("^hello"), > OrPattern(LiteralPattern("&begin"), OrPattern(LiteralPattern("@webtext"), > LiteralPattern("anytext"))))); > > as opposed to > > Pattern pattern("^hello|&begin|@webtext|anytext"); Not sure I'd agree that the solution to improve readability is to introduce a new regular expression language which uses "^" for something other than "beginning" (why not "=" for exact and "^" for beginning?). As demonstrated by gmock and boost, you can do a lot to improve syntax. You could assume literal strings are literals, and drop the duplicate uses of "Pattern" in the template expanders. Then the example could pretty reasonably be written as: Pattern pattern(Or("^hello",Or("&begin", Or("@webtext","anytext")))); Or have multiple initializers (or an overloaded Create function): Pattern pattern(Exact("hello"), Begin("begin"), Web("webtext"), "anytext"); Or if you were committed to the prefixes: Pattern pattern("^hello", "&begin", "@webtext", "anytext"); Or be inspired by gmock: Pattern pattern = Pattern("^hello").or("&begin").or("@webtext").or("anytext"); [I don't know whether to grin at that or toss my cookies, but there you are, probably more acceptable than overloading |().] -scott
I've added a WebKit API to access WebCore::RegularExpression, and I'm going to use that for autofill. Thanks for the reviews/comments. Closing this issue. |