|
|
Chromium Code Reviews|
Created:
3 years, 10 months ago by Alexei Svitkine (slow) Modified:
3 years, 10 months ago Reviewers:
Nico CC:
chromium-reviews Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
DescriptionFix xmb_unittest.py in GRIT.
It was broken on systems with a "narrow" build of Python, see:
http://stackoverflow.com/questions/31603075/how-can-i-represent-this-regex-to-not-get-a-bad-character-range-error
It turns out the regex was not needed at all - as invalid characters do not even
reach that point. This removes the check and adds a unit test to verify that
the tool produces an error when characters outside the valid range are specified.
BUG=693086
Review-Url: https://codereview.chromium.org/2697383002
Cr-Commit-Position: refs/heads/master@{#451370}
Committed: https://chromium.googlesource.com/chromium/src/+/56dfab6710187ab2f41a917c1ce99529836bd6e7
Patch Set 1 #Patch Set 2 : comment change #Patch Set 3 : Remove regex and add a test. #Patch Set 4 : No need for tempfile. #Patch Set 5 : Correct sequence for \U00110000. #Messages
Total messages: 25 (11 generated)
Description was changed from ========== Fix xmb_unittest.py in GRIT. It was broken on systems with a "narrow" build of Python. BUG=693086 ========== to ========== Fix xmb_unittest.py in GRIT. It was broken on systems with a "narrow" build of Python. This replaces with equivalent regex syntax that was produced via (https://mothereff.in/regexpu) as suggested by: http://stackoverflow.com/questions/31603075/how-can-i-represent-this-regex-to... BUG=693086 ==========
asvitkine@chromium.org changed reviewers: + thakis@chromium.org
Description was changed from ========== Fix xmb_unittest.py in GRIT. It was broken on systems with a "narrow" build of Python. This replaces with equivalent regex syntax that was produced via (https://mothereff.in/regexpu) as suggested by: http://stackoverflow.com/questions/31603075/how-can-i-represent-this-regex-to... BUG=693086 ========== to ========== Fix xmb_unittest.py in GRIT. It was broken on systems with a "narrow" build of Python. This replaces with equivalent regex syntax that was produced via https://mothereff.in/regexpu as suggested by: http://stackoverflow.com/questions/31603075/how-can-i-represent-this-regex-to... BUG=693086 ==========
Thanks! Doesn't this break things on wide python builds, where each char is probably ucs-4 instead of utf-16 in memory?
On 2017/02/16 20:37:12, Nico wrote: > Thanks! Doesn't this break things on wide python builds, where each char is > probably ucs-4 instead of utf-16 in memory? Per the stack overflow, they say it should work with both versions. I patched this in on Linux and tested it and the tests pass - so I think it works. (Removing that part of the regex makes the tests fail so presumably they are testing it.)
lgtm since it helps. Do you understand why it works on wide builds though? I read the article you linked to several times (both before your comment and after) and I don't see where it says that this is supposed to work with both builds. From what I understand, in a narrow build, unicode strings are a sequence of char16_ts. In wide builds, it's a sequence of char32_ts. In a narrow build, a character outside of the BMP will be two python "characters" comprising a surrogate pair. In a wide build, it'll just be one regular character. After this patch, we're matching the utf-16 surrogate pair -- which isn't present in a python wide build unicode string from what I understand. So I don't understand why this works after this patch -- maybe our tests aren't exhaustive or something. I would've expected that we'd need the current regex in wide builds and the new one in narrow builds. (Maybe the python regex engine converts surrogate pairs in wide builds? I guess that's easy to check in the repl?)
Thanks for looking deeper. Re-reading the link, actually I agree with you - this shouldn't work in wide Python... And using their example also doesn't work in wide Python: >>> input = u'Sample \U0001d340. Another \U0001d305. Leave alone \U00011000' >>> re.sub(u'\ud834[\udf00-\udf56]', '', input) u'Sample \U0001d340. Another \U0001d305. Leave alone \U00011000' I guess you're right that the tests don't have full coverage here. Sigh! I'll look more into it. On Thu, Feb 16, 2017 at 4:26 PM, <thakis@chromium.org> wrote: > lgtm since it helps. Do you understand why it works on wide builds though? > I > read the article you linked to several times (both before your comment and > after) and I don't see where it says that this is supposed to work with > both > builds. From what I understand, in a narrow build, unicode strings are a > sequence of char16_ts. In wide builds, it's a sequence of char32_ts. In a > narrow > build, a character outside of the BMP will be two python "characters" > comprising > a surrogate pair. In a wide build, it'll just be one regular character. > After > this patch, we're matching the utf-16 surrogate pair -- which isn't > present in a > python wide build unicode string from what I understand. So I don't > understand > why this works after this patch -- maybe our tests aren't exhaustive or > something. I would've expected that we'd need the current regex in wide > builds > and the new one in narrow builds. (Maybe the python regex engine converts > surrogate pairs in wide builds? I guess that's easy to check in the repl?) > > https://codereview.chromium.org/2697383002/ > -- You received this message because you are subscribed to the Google Groups "Chromium-reviews" group. To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
I don't get why it works on wide Python. The unit test does indeed specify a character within the relevant range: \U0001F60A I tried expanding them with more such characters within the range, but it accepted them all. So I'm not sure what I'm missing and what's different between this case and the case in the stackoverflow example which doesn't work.
Looking at this more closely, I think it's because of how the new syntax interacts with ^ in wide builds. The test is only testing that a string gets accepted, but there's no test coverage for the string being denied. So I think in wide builds, the regex ends up accepting everything. At least, I *think* that's what's going on. Let me expanding the test coverage to confirm this and if so revise the fix.
On 2017/02/17 17:41:23, Alexei Svitkine (slow) wrote: > Looking at this more closely, I think it's because of how the new syntax > interacts with ^ in wide builds. > > The test is only testing that a string gets accepted, but there's no test > coverage for the string being denied. So I think in wide builds, the regex ends > up accepting everything. > > At least, I *think* that's what's going on. Let me expanding the test coverage > to confirm this and if so revise the fix. .... And it turns out I can't get the exception based on that regexp to trigger at all! Any characters not covered by that regex end up being caught earlier on - e.g. via SAXParseException or via SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes. So I'm tempted to just remove the regexp and the code that checks it. All the tests will still pass. WDYT?
On 2017/02/17 18:16:43, Alexei Svitkine (slow) wrote: > On 2017/02/17 17:41:23, Alexei Svitkine (slow) wrote: > > Looking at this more closely, I think it's because of how the new syntax > > interacts with ^ in wide builds. > > > > The test is only testing that a string gets accepted, but there's no test > > coverage for the string being denied. So I think in wide builds, the regex > ends > > up accepting everything. > > > > At least, I *think* that's what's going on. Let me expanding the test coverage > > to confirm this and if so revise the fix. > > .... And it turns out I can't get the exception based on that regexp to trigger > at all! > > Any characters not covered by that regex end up being caught earlier on - e.g. > via SAXParseException or via SyntaxError: (unicode error) 'unicodeescape' codec > can't decode bytes. > > So I'm tempted to just remove the regexp and the code that checks it. All the > tests will still pass. WDYT? From what I understand, we try to do some verification so that we can predict if the XMB tool that actually does the translation won't fail much later (see description of https://codereview.chromium.org/1440313003/). If things still throw as-is, then I guess getting rid of this sounds fine to me.
PTAL. I removed the regexp and crafted a test that verified that an error is thrown for a valid outside of the supported range.
Description was changed from ========== Fix xmb_unittest.py in GRIT. It was broken on systems with a "narrow" build of Python. This replaces with equivalent regex syntax that was produced via https://mothereff.in/regexpu as suggested by: http://stackoverflow.com/questions/31603075/how-can-i-represent-this-regex-to... BUG=693086 ========== to ========== Fix xmb_unittest.py in GRIT. It was broken on systems with a "narrow" build of Python, see: http://stackoverflow.com/questions/31603075/how-can-i-represent-this-regex-to... It turns out the regex was not needed at all - as invalid characters do not even reach that point. This removes the check and adds a unit test to verify that the tool produces an error when characters outside the valid range are specified. BUG=693086 ==========
lgtm
The CQ bit was checked by asvitkine@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from thakis@chromium.org Link to the patchset: https://codereview.chromium.org/2697383002/#ps60001 (title: "No need for tempfile.")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by asvitkine@chromium.org
The CQ bit was checked by asvitkine@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from thakis@chromium.org Link to the patchset: https://codereview.chromium.org/2697383002/#ps80001 (title: "Correct sequence for \U00110000.")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
CQ is committing da patch.
Bot data: {"patchset_id": 80001, "attempt_start_ts": 1487359315613430,
"parent_rev": "d4601dd9252054edfcd8bb4e4c2a93f11c307d85", "commit_rev":
"56dfab6710187ab2f41a917c1ce99529836bd6e7"}
Message was sent while issue was closed.
Description was changed from ========== Fix xmb_unittest.py in GRIT. It was broken on systems with a "narrow" build of Python, see: http://stackoverflow.com/questions/31603075/how-can-i-represent-this-regex-to... It turns out the regex was not needed at all - as invalid characters do not even reach that point. This removes the check and adds a unit test to verify that the tool produces an error when characters outside the valid range are specified. BUG=693086 ========== to ========== Fix xmb_unittest.py in GRIT. It was broken on systems with a "narrow" build of Python, see: http://stackoverflow.com/questions/31603075/how-can-i-represent-this-regex-to... It turns out the regex was not needed at all - as invalid characters do not even reach that point. This removes the check and adds a unit test to verify that the tool produces an error when characters outside the valid range are specified. BUG=693086 Review-Url: https://codereview.chromium.org/2697383002 Cr-Commit-Position: refs/heads/master@{#451370} Committed: https://chromium.googlesource.com/chromium/src/+/56dfab6710187ab2f41a917c1ce9... ==========
Message was sent while issue was closed.
Committed patchset #5 (id:80001) as https://chromium.googlesource.com/chromium/src/+/56dfab6710187ab2f41a917c1ce9... |
