|
|
Created:
5 years, 5 months ago by jungshik at Google Modified:
4 years, 9 months ago CC:
chromium-reviews, cbentzel+watch_chromium.org, brentlondon_google.com, jshin+watch_chromium.org Base URL:
https://chromium.googlesource.com/chromium/src.git@master Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
DescriptionImplement a new IDN display policy
The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:
1. Use moderately restrictive rules for script mixing [1] with additional
restrictions on mixing with Latin.
- Script mixing is only allowed with ASCII-Latin (instead
of any Latin) + another script allowed at the moderatate
restriction level
2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from
UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4].
3. Allow 5 aspirational scripts from UAX 31 [5]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
combining mark.
6. Turn off whole script confusable check. It'd block some common
domain labels like рф (IDN ccTLD for '.ru'),
'bücher' (German) and 'färgbolaget' (Swedish).
7. Keep ON 'mixed script confusable' check. This is different/separate
from 'script mixing restriction' and will catch cases like 'gօօgle'
with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise
allowed by rules #1 ~ #5.
8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
mistaken as a slash. (this has been in place for a few years and is kept.)
9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008)
encoded in punycode/ACE are always shown in Punycode. This is to make
the display policy consistent with our prior decision to use UTS 46
'transitional' processing (map or drop the 4 deviation characters.). [9]
10. Character black list (Mozilla's : [8]) is trimmed down to two characters.
Note that this is almost identical to Mozilla's IDN display algorithm
[7] except for #7, #8, and an additional restrictions in #1. #9 is another difference
because of Mozilla's use of UTS 46 'non-transitional' processing and our use of UTS 46 'transitional' processing.
Most of domains filtered out in ".com" TLD is filtered due to the
character set restrictiction (#2 and #3) that accounts for 94% (2,050)
of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD).
All the IDN TLDs are shown in Unicode. So are all the IDNs in the
effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k).
48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown
in punycode.
P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and adjust callers.
P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome will be updated after this CL is landed.
[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://goo.gl/L3WD1s
[5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None
[7] https://wiki.mozilla.org/IDN_Display_Algorithm
[8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them
are blocked or mapped any way by other restrictions/mechanism in place.
See https://bugzilla.mozilla.org/show_bug.cgi?id=1257108
[9] This is to "fix" bug 595263
BUG=336973, 595263
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
--gtest_filter=*Puny*
Committed: https://crrev.com/62a928390ba06db29576bbb32606696b3e16a66c
Cr-Commit-Position: refs/heads/master@{#382029}
Patch Set 1 #Patch Set 2 : use lazy instance, remove the old code #Patch Set 3 : a little bit reorg #Patch Set 4 : restore languages as dummy arg for now #Patch Set 5 : restore languages in one more function #Patch Set 6 : add back languages to one more, update comments #
Total comments: 15
Patch Set 7 : manual rebase after move to components #Patch Set 8 : format url test fix #Patch Set 9 : review comments addressed + alpha #Patch Set 10 : typo fix #
Total comments: 36
Patch Set 11 : rebased #Patch Set 12 : address review comments + fix some test (still WiP) #Patch Set 13 : whole_script check experiment #Patch Set 14 : back to disable wholescript conf. check + comment update #Patch Set 15 : switch to moderately restrictive + turn off wholescirpt confusable check #Patch Set 16 : fix typos, clean up the allowed set. #Patch Set 17 : comment clean up #Patch Set 18 : more tweaks + temporary log for testing #Patch Set 19 : U+30FB/FC tweak #Patch Set 20 : more tweaking for U+30FC/B handling #Patch Set 21 : dangerous check - thread safety #Patch Set 22 : deviation character check added #Patch Set 23 : add new IDN tests and drop special casing of U+30FC/B #Patch Set 24 : adjust two IDN test entries #
Total comments: 42
Patch Set 25 : url_canon test: wchar* needs a surrogate pair on *nix #
Total comments: 10
Patch Set 26 : review commetns addressed #Patch Set 27 : drop U+2027 and add tests for U+2027 and U+05F4 #
Total comments: 27
Patch Set 28 : use TLS and update comments #Patch Set 29 : more comment update per Peter #
Messages
Total messages: 82 (40 generated)
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1258813002/60001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1258813002/60001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: cast_shell_linux on tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/cast_shell_linu...) ios_dbg_simulator_ninja on tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios_dbg_simulator...)
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1258813002/80001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1258813002/80001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: mac_chromium_compile_dbg_ng on tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_comp...)
rsleevi@chromium.org changed reviewers: + rsleevi@chromium.org
I don't know why I had these drafts, but uh, have some drafts? :) https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu.cc File net/base/net_util_icu.cc (right): https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu.... net/base/net_util_icu.cc:33: class IDNSpoofChecker { Document https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu.... net/base/net_util_icu.cc:36: bool check(const base::char16* label, int label_len); naming nit: These should be called Check DANGER: using "int" for buffer length's is quite dangerous! Would a base::StringPiece16 work here (at least to control the danger). Alternatively, would size_t work with a checked cast? https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu.... net/base/net_util_icu.cc:36: bool check(const base::char16* label, int label_len); Document https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu.... net/base/net_util_icu.cc:46: class IDNSpoofCheckerExtra { Document https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu.... net/base/net_util_icu.cc:49: bool check(const base::char16* label, int label_len); Same comments as above https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu.... net/base/net_util_icu.cc:63: DCHECK(U_SUCCESS(status)) << "spoof checker failed to open with error: " Is DCHECK really appropriate? Does this indicate programmer error? Or fatal library failure? https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu.... net/base/net_util_icu.cc:137: DCHECK(U_SUCCESS(status)); Shouldn't you actually handle if this can fail?
Thank you for the comments. I just happened to come back to this CL and was about to rebase it (because IDN display moved to component/). I'll address your comments while rebasing. On 2015/08/28 00:35:15, Ryan Sleevi wrote: > I don't know why I had these drafts, but uh, have some drafts? :) > > https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu.cc > File net/base/net_util_icu.cc (right): > > https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu.... > net/base/net_util_icu.cc:33: class IDNSpoofChecker { > Document > > https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu.... > net/base/net_util_icu.cc:36: bool check(const base::char16* label, int > label_len); > naming nit: These should be called Check > DANGER: using "int" for buffer length's is quite dangerous! Would a > base::StringPiece16 work here (at least to control the danger). Alternatively, > would size_t work with a checked cast? > > https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu.... > net/base/net_util_icu.cc:36: bool check(const base::char16* label, int > label_len); > Document > > https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu.... > net/base/net_util_icu.cc:46: class IDNSpoofCheckerExtra { > Document > > https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu.... > net/base/net_util_icu.cc:49: bool check(const base::char16* label, int > label_len); > Same comments as above > > https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu.... > net/base/net_util_icu.cc:63: DCHECK(U_SUCCESS(status)) << "spoof checker failed > to open with error: " > Is DCHECK really appropriate? Does this indicate programmer error? Or fatal > library failure? > > https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu.... > net/base/net_util_icu.cc:137: DCHECK(U_SUCCESS(status)); > Shouldn't you actually handle if this can fail?
Ryan, can you take another look? I'll also add a bunch of new IDNs to the unittest. https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu.cc File net/base/net_util_icu.cc (right): https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu.... net/base/net_util_icu.cc:33: class IDNSpoofChecker { On 2015/08/28 00:35:15, Ryan Sleevi wrote: > Document Done. https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu.... net/base/net_util_icu.cc:36: bool check(const base::char16* label, int label_len); On 2015/08/28 00:35:14, Ryan Sleevi wrote: > naming nit: These should be called Check > DANGER: using "int" for buffer length's is quite dangerous! Would a > base::StringPiece16 work here (at least to control the danger). Alternatively, > would size_t work with a checked cast? Done. https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu.... net/base/net_util_icu.cc:36: bool check(const base::char16* label, int label_len); On 2015/08/28 00:35:14, Ryan Sleevi wrote: > naming nit: These should be called Check > DANGER: using "int" for buffer length's is quite dangerous! Would a > base::StringPiece16 work here (at least to control the danger). Alternatively, > would size_t work with a checked cast? I switched to StringPiece16 (I thought about it but didn't bother...). https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu.... net/base/net_util_icu.cc:36: bool check(const base::char16* label, int label_len); On 2015/08/28 00:35:15, Ryan Sleevi wrote: > Document Done. https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu.... net/base/net_util_icu.cc:46: class IDNSpoofCheckerExtra { On 2015/08/28 00:35:15, Ryan Sleevi wrote: > Document Done. https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu.... net/base/net_util_icu.cc:49: bool check(const base::char16* label, int label_len); On 2015/08/28 00:35:15, Ryan Sleevi wrote: > Same comments as above Done. https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu.... net/base/net_util_icu.cc:63: DCHECK(U_SUCCESS(status)) << "spoof checker failed to open with error: " On 2015/08/28 00:35:15, Ryan Sleevi wrote: > Is DCHECK really appropriate? Does this indicate programmer error? Or fatal > library failure? This should never happen unless for some reason, ICU data is not available or ICU data is broken/malformed. Unfortunately, there are mysterious crash reports that could only happen with ICU data missing/unavailable/broken on Android / iOS (e.g. internal bug : b/23186531 ). If the ICU data is missing/malformed/.., we're hosed anyway and we can just use CHECK here. Alternatively, we can move along (disabling IDN spoof check and displaying all the IDNs in punycode) hoping that the rest of ICU is somehow all right. https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu.... net/base/net_util_icu.cc:137: DCHECK(U_SUCCESS(status)); On 2015/08/28 00:35:15, Ryan Sleevi wrote: > Shouldn't you actually handle if this can fail? I agree. Done.
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1258813002/180001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1258813002/180001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_rel_ng on tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...) win8_chromium_ng on tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win8_chromium_ng/...)
Sorry for the delay. I've taken another pass at reviewing - while I don't know the ICU APIs, I'm starting to learn from reviewing this code :) My high-level suggestion would be to consider documenting some of the policy and trickiness in a markdown file (perhaps in this same directory), given the Chromium Wiki shuttering. You could also do it on dev.chromium.org, although the markdown of course has the benefit that it's in the source tree. In particular, the discussion about "first pass" and "second pass" in the code suggest some non-trivialness (if you just read the comments), and it doesn't really make sense until you look at all of the code. Thus, it might help to provide a brief high-level overview of the policies, and then it's clear how the code implements them. It might even just make sense as a file-level comment - a .md file may be too heavyweight for how 'simple' it is - but that's the most significant bit of feedback I'd offer from reviewing this. https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... File components/url_formatter/url_formatter.cc (right): https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:239: // with the level of script mixing, allowed characters, and types of checks. Comment nit: I found this comment a bit hard to read. I've tried rewriting it, both to make sure I correctly understand and to offer feedback. // A helper class for the first pass of IDN spoof checking. I'm not clear what "level of script mixing, allowed characters, and types of checks" is meant to convey. As I read it, it suggests parameters / inputs, but perhaps you mean it as "The policies that the Chromium security and internationalization teams have decided on", in which case, it might be best set // A helper class for the first pass of IDN spoof checking, used to // ensure that no IDN input meets Chromium's standard of spoofability. // For a more thorough explanation of how spoof checking works in // Chromium, see [link to Markdown file explaining?] https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:247: // check is necessary. Comment nit: Suggested rewording // Returns true if |label| is safe to display as Unicode. In the event // of library failure, all IDN inputs will be treated as unsafe. In this reword, I try to remove the documentation about the implementation ("is done with |checker_|", "Besides, it internally calls"), since that should be both obvious from the code and fragile to document. Instead, it just spells out the API contract. https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:261: // |USpoofChecker| in IDNSpoofChecker. // A helper class for the second pass of IDN spoof checking. // For a more thorough explanation of how spoof checking works in Chromium, // see [link to Markdown file explaining first and second passes] It's unclear to me why this is a separate helper object, as opposed to just being one spoof checker that does both passes, but I haven't read the rest of the code yet, I'm just basing it on "read this file and try to understand from header comments what will happen" https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:266: // in Unicode. Called by IDNSpoofChecker::Check. Comment nit: I would remove the second sentence - that's discussing implementation details of some 'other' implementation (even if it is in the same file) // Returns true if |label| is safe to display as Unicode according to // extended spoofability criteria. Because this extended criteria may // be expensive, it should only be used after first ensuring that |label| // is safe using IDNSpoofChecker::Check. Of course, the problem with the above suggested reword is you see how tightly coupled IDNSpoofChecker and IDNSpoofCheckerExtra are. If *Extra is an implementation detail of IDNSpoofChecker, then SpoofChecker shouldn't be talking about it. If it's the caller's responsibility to call both, then we should document that. But if IDNSpoofChecker is calling IDNSpoofCheckerExtra and that's the API contract, then the comments should be clearer to that point, such as // Implementation detail of IDNSpoofChecker. // As some checks for spoofability are expensive to compute, IDNSpoofCheckerExtra // implements the secondary checks after a given |label| has passed // the initial spoof checking. // This class should never be called directly - spoofability should // be checked with IDNSpoofChecker - and just exists to encapsulate the // more expensive logic. https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:292: // The default is highly restrictive so that it's not set explicitly. I have trouble making sense of this sentence. It seems to suggest that the API call is unnecessary (because it's the default), but then you're setting it, so I'm not sure how to make sense of it. https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:294: // using that, instead. Is there a bug open for this review? Just some sort of tracking bug for Chromium :) https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:301: // released and we update our copy of ICU. Comment nit: Avoid pronouns in comments ( https://groups.google.com/a/chromium.org/forum/#!topic/chromium-dev/NH-S6KCkr2M ) // The recommended set and inclusion set come from ... // This list may change over time, and will be updated whenever the // included version of ICU is updated. (This avoids ambiguity for things like Chromium linking against system ICU, or non-Chromium distributions, etc - the "who is 'we'" question) https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:307: allowed_set.addAll(*inclusion_set); So you explain where the lists come from, but don't quite explain the policy here (as in, why is it OK to add these to the allowed sets). I suspect this is something that might be best in an .md or explainer. https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:319: // U+1F600) ? Document and explain :P https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:327: // Hyphenation Point comment nit: "." at the end of Hyphenation Point (ending the sentence) comment nit: This reads a bit weird, the "not yet excluded from |allowed_set|". Up until this point, there's no exclusions from allowed_set, so it's kinda grammatically weird. A suggested reword might be something like // The following three characters are included in (recommended? inclusion) set, // but are blacklisted as part of Mozilla's IDN blacklist (link here). // Explicitly remove them, which does nothing if they're removed from // (recommended, inclusion?) set in a future ICU update. https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:333: // them out as Mozilla does. Bug #? :) https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:348: // confusable characters for all letters in them. comment nit: Grammatically, this is weird because of the double "because" // Disable whole-script-confusable checks. // While this check would be desirable, simple strings such as // 'pax' (Latin) and 'b<u-umlaut>cher' (Latin?) are whole script confusable // with other scripts (such as Cyrillic/Greek). [Did the above capture the concerns?] I'm also not sure the "check against a list of well known good domain names" - is this talking about whitelisted TLDs? whitelisted strings? Something else? https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:386: "[^\\p{Katakana}\\p{Hiragana}\\p{Han}]" As written, doesn't this mean that a lone katakana no, so, zo, or n (without any context) will be allowed, since it'll fail the regex before/after? Or that "\\u0061\\u30ce" would be treated as safe (I'm probably butchering things here) https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:389: // mark, but we block a sequence of similary looking same comments about "we" https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:390: // Japanese combining marks as well. So, overall comment wise, this is somewhat hard to read because the comments are written for the positive case, but the actual regex is the negative (dangerous patterns). It may help to ensure the comments match the negativity. // Disallow the katakana no, so, zo, or n, as they may be mistaken // for slashes, unless they're entirely enclosed by Katakana, Hiragana, // or Han scripts. // Disallow repeating Japanese accent characters. Because // USPOOF_INVISIBLE will only check for repeated occurrences of the // same combining mark, it's necessary to block any sequence of // similar looking Japanese combining marks as well. https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:397: // This is called only if script mixing is detected. Same comment about documenting 'how it's used' and such. This comment seems superfluous, but if you're trying to document a precondition, do it as such // |label| is a string that contains multiple, mixed scripts. https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:399: // ASCII-Latin instead of any Latin. comment nit: s/ / / https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:411: // TODO(jshin): Check spoofing attempt against a list of 'good' domains BUG #? Is this domains (labels?) or TLDs?
Ping?
On 2015/12/08 01:39:27, Ryan Sleevi wrote: > Ping? yeah... it's been in my todo list. I'll update the list with new tests added tomorrow.
On 2015/12/10 01:05:58, jshin (jungshik at g) wrote: > On 2015/12/08 01:39:27, Ryan Sleevi wrote: > > Ping? > > yeah... it's been in my todo list. I'll update the list with new tests added > tomorrow. And, I'll add markdown and address your review comments (somehow I missed your 3-month old comments completely).
Description was changed from ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use Highly restrictive rules for script mixing [1] with an added restriction on Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + Han + {Hangul, Bofomopo, Hiragana+Katagana}. 2. Allow the recommended and inclusion sets from UAX 39 [2] 3. Allow 5 aspirational scripts from UAX 31 [3] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block 'pax' because p,a, or x has a Cyrillic look-alike. The same is true of 'bücher' (German) or 'färgbolaget' (Swedish). 7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. 8. Block repeating Japanese combining marks (U+3099 - U+309C) even when different ones come in sequence. If the same one comes in a row, they'll be blocked by rule #5 above. P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and callers. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts BUG=336973 TEST=net_unittests --gtest_filter=*IDN, --gtest_filter=FormatUrl* ========== to ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use Highly restrictive rules for script mixing [1] with an added restriction on Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + Han + {Hangul, Bofomopo, Hiragana+Katagana}. 2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3] 3. Allow 5 aspirational scripts from UAX 31 [4] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. 8. Block repeating Japanese combining marks (U+3099 - U+309C) even when different ones come in sequence. If the same one comes in a row, they'll be blocked by rule #5 above. P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and callers. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=FormatUrl* ==========
Description was changed from ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use Highly restrictive rules for script mixing [1] with an added restriction on Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + Han + {Hangul, Bofomopo, Hiragana+Katagana}. 2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3] 3. Allow 5 aspirational scripts from UAX 31 [4] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. 8. Block repeating Japanese combining marks (U+3099 - U+309C) even when different ones come in sequence. If the same one comes in a row, they'll be blocked by rule #5 above. P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and callers. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=FormatUrl* ========== to ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use Highly restrictive rules for script mixing [1] with an added restriction on Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + Han + {Hangul, Bofomopo, Hiragana+Katagana}. 2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3] 3. Allow 5 aspirational scripts from UAX 31 [4] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. 8. Block repeating Japanese combining marks (U+3099 - U+309C) even when different ones come in sequence. If the same one comes in a row, they'll be blocked by rule #5 above. P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and callers. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm* ==========
Description was changed from ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use Highly restrictive rules for script mixing [1] with an added restriction on Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + Han + {Hangul, Bofomopo, Hiragana+Katagana}. 2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3] 3. Allow 5 aspirational scripts from UAX 31 [4] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. 8. Block repeating Japanese combining marks (U+3099 - U+309C) even when different ones come in sequence. If the same one comes in a row, they'll be blocked by rule #5 above. P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and callers. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm* ========== to ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use Highly restrictive rules for script mixing [1] with an added restriction on Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + Han + {Hangul, Bofomopo, Hiragana+Katagana}. 2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3] 3. Allow 5 aspirational scripts from UAX 31 [4] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. 8. Block repeating Japanese combining marks (U+3099 - U+309C) even when different ones come in sequence. If the same one comes in a row, they'll be blocked by rule #5 above. P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and callers. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ==========
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1258813002/320001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1258813002/320001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
Description was changed from ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use Highly restrictive rules for script mixing [1] with an added restriction on Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + Han + {Hangul, Bofomopo, Hiragana+Katagana}. 2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3] 3. Allow 5 aspirational scripts from UAX 31 [4] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. 8. Block repeating Japanese combining marks (U+3099 - U+309C) even when different ones come in sequence. If the same one comes in a row, they'll be blocked by rule #5 above. P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and callers. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ========== to ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with an added restriction on Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level' - Block about 40 non-Latin characters (that look similar to at least one of characters in Latin-ASCII from mixing with Latin-ASCII) 2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3] 3. Allow 5 aspirational scripts from UAX 31 [4] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. 8. Block repeating Japanese combining marks (U+3099 - U+309C) even when different ones come in sequence. If the same one comes in a row, they'll be blocked by rule #5 above. Note that this is almost identical to Mozilla's IDN display policy except for #7, #8 and two additional restrictions in #1. Two other differences are that there is NO TLD whitelist (Mozilla plans to get rid of the whitelist, but it is still used atm.) and that Mozilla's character black list is not used except for one character (Long solidus overlay). P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and callers. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ==========
Description was changed from ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with an added restriction on Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level' - Block about 40 non-Latin characters (that look similar to at least one of characters in Latin-ASCII from mixing with Latin-ASCII) 2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3] 3. Allow 5 aspirational scripts from UAX 31 [4] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. 8. Block repeating Japanese combining marks (U+3099 - U+309C) even when different ones come in sequence. If the same one comes in a row, they'll be blocked by rule #5 above. Note that this is almost identical to Mozilla's IDN display policy except for #7, #8 and two additional restrictions in #1. Two other differences are that there is NO TLD whitelist (Mozilla plans to get rid of the whitelist, but it is still used atm.) and that Mozilla's character black list is not used except for one character (Long solidus overlay). P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and callers. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ========== to ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with an added restriction on Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level' - Block about 40 non-Latin characters (that look similar to at least one of characters in Latin-ASCII from mixing with Latin-ASCII) 2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3] 3. Allow 5 aspirational scripts from UAX 31 [4] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. 8. Block repeating Japanese combining marks (U+3099 - U+309C) even when different ones come in sequence. If the same one comes in a row, they'll be blocked by rule #5 above. Note that this is almost identical to Mozilla's IDN display algorithm [5] except for #7, #8 and two additional restrictions in #1. Two other differences are that there is NO TLD whitelist (Mozilla plans to get rid of the whitelist, but it is still used atm.) and that Mozilla's character black list is not used except for one character (U+0338 Long solidus overlay). [5] P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and callers. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [5] https://wiki.mozilla.org/IDN_Display_Algorithm [6] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ==========
Description was changed from ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with an added restriction on Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level' - Block about 40 non-Latin characters (that look similar to at least one of characters in Latin-ASCII from mixing with Latin-ASCII) 2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3] 3. Allow 5 aspirational scripts from UAX 31 [4] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. 8. Block repeating Japanese combining marks (U+3099 - U+309C) even when different ones come in sequence. If the same one comes in a row, they'll be blocked by rule #5 above. Note that this is almost identical to Mozilla's IDN display algorithm [5] except for #7, #8 and two additional restrictions in #1. Two other differences are that there is NO TLD whitelist (Mozilla plans to get rid of the whitelist, but it is still used atm.) and that Mozilla's character black list is not used except for one character (U+0338 Long solidus overlay). [5] P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and callers. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [5] https://wiki.mozilla.org/IDN_Display_Algorithm [6] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ========== to ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with an added restriction on Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level' - Block about 40 non-Latin characters (that look similar to at least one of characters in Latin-ASCII from mixing with Latin-ASCII) 2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3] 3. Allow 5 aspirational scripts from UAX 31 [4] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. 8. Block repeating Japanese combining marks (U+3099 - U+309C) even when different ones come in sequence. If the same one comes in a row, they'll be blocked by rule #5 above. Note that this is almost identical to Mozilla's IDN display algorithm [5] except for #7, #8 and two additional restrictions in #1. Two other differences are that there is NO TLD whitelist (Mozilla plans to get rid of the whitelist, but it is still used atm.) and that Mozilla's character black list is not used except for one character (U+0338 Long solidus overlay). [6] P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and callers. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [5] https://wiki.mozilla.org/IDN_Display_Algorithm [6] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ==========
Description was changed from ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with an added restriction on Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level' - Block about 40 non-Latin characters (that look similar to at least one of characters in Latin-ASCII from mixing with Latin-ASCII) 2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3] 3. Allow 5 aspirational scripts from UAX 31 [4] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. 8. Block repeating Japanese combining marks (U+3099 - U+309C) even when different ones come in sequence. If the same one comes in a row, they'll be blocked by rule #5 above. Note that this is almost identical to Mozilla's IDN display algorithm [5] except for #7, #8 and two additional restrictions in #1. Two other differences are that there is NO TLD whitelist (Mozilla plans to get rid of the whitelist, but it is still used atm.) and that Mozilla's character black list is not used except for one character (U+0338 Long solidus overlay). [6] P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and callers. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [5] https://wiki.mozilla.org/IDN_Display_Algorithm [6] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ========== to ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with additional restrictions on mixing with Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level' - Block about 40 non-Latin characters (that look similar to at least one of characters in Latin-ASCII from mixing with Latin-ASCII) 2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3] 3. Allow 5 aspirational scripts from UAX 31 [4] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. 8. Block repeating Japanese combining marks (U+3099 - U+309C) even when different ones come in sequence. If the same one comes in a row, they'll be blocked by rule #5 above. Note that this is almost identical to Mozilla's IDN display algorithm [5] except for #7, #8 and two additional restrictions in #1. Two other differences are that there is NO TLD whitelist (Mozilla plans to get rid of the whitelist, but it is still used atm.) and that Mozilla's character black list is not used except for one character (U+0338 Long solidus overlay). [6] P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and callers. P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome will be updated after this CL is landed. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [5] https://wiki.mozilla.org/IDN_Display_Algorithm [6] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ==========
I saw you were tweaking this CL - did you mean to publish for review?
Description was changed from ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with additional restrictions on mixing with Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level' - Block about 40 non-Latin characters (that look similar to at least one of characters in Latin-ASCII from mixing with Latin-ASCII) 2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3] 3. Allow 5 aspirational scripts from UAX 31 [4] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. 8. Block repeating Japanese combining marks (U+3099 - U+309C) even when different ones come in sequence. If the same one comes in a row, they'll be blocked by rule #5 above. Note that this is almost identical to Mozilla's IDN display algorithm [5] except for #7, #8 and two additional restrictions in #1. Two other differences are that there is NO TLD whitelist (Mozilla plans to get rid of the whitelist, but it is still used atm.) and that Mozilla's character black list is not used except for one character (U+0338 Long solidus overlay). [6] P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and callers. P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome will be updated after this CL is landed. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [5] https://wiki.mozilla.org/IDN_Display_Algorithm [6] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ========== to ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with additional restrictions on mixing with Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level' 2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] 3. Allow 5 aspirational scripts from UAX 31 [4] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Keep ON 'mixed script confusable' check. This is different/separate from 'script mixing restriction'. 8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. 9. Block repeating Japanese combining marks (U+3099 - U+309C) even when different ones come in sequence. If the same one comes in a row, they'll be blocked by rule #5 above. Note that this is almost identical to Mozilla's IDN display algorithm [5] except for #7, #8 and two additional restrictions in #1. Two other differences are that there is NO TLD whitelist (Mozilla plans to get rid of the whitelist, but it is still used atm.) and that Mozilla's character black list is not used except for one character (U+0338 Long solidus overlay). [6] P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and callers. P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome will be updated after this CL is landed. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [5] https://wiki.mozilla.org/IDN_Display_Algorithm [6] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ==========
Description was changed from ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with additional restrictions on mixing with Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level' 2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] 3. Allow 5 aspirational scripts from UAX 31 [4] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Keep ON 'mixed script confusable' check. This is different/separate from 'script mixing restriction'. 8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. 9. Block repeating Japanese combining marks (U+3099 - U+309C) even when different ones come in sequence. If the same one comes in a row, they'll be blocked by rule #5 above. Note that this is almost identical to Mozilla's IDN display algorithm [5] except for #7, #8 and two additional restrictions in #1. Two other differences are that there is NO TLD whitelist (Mozilla plans to get rid of the whitelist, but it is still used atm.) and that Mozilla's character black list is not used except for one character (U+0338 Long solidus overlay). [6] P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and callers. P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome will be updated after this CL is landed. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [5] https://wiki.mozilla.org/IDN_Display_Algorithm [6] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ========== to ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with additional restrictions on mixing with Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level 2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4]. 3. Allow 5 aspirational scripts from UAX 31 [5] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Keep ON 'mixed script confusable' check. This is different/separate from 'script mixing restriction'. 8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. 9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008) encoded in punycode/ACE are always shown in Punycode. This is to make the display policy consistent with our decision to use UTS 46 'transitional' processing (map or drop the 4 deviation characters.). Note that this is almost identical to Mozilla's IDN display algorithm [6] except for #7, #8, #9 and an additional restrictions in #1. Two other differences are that there is NO TLD whitelist (Mozilla plans to get rid of the whitelist, but it is still used atm.) and that Mozilla's character black list is not used except for one character (U+0338 Long solidus overlay). [7] P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and callers. P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome will be updated after this CL is landed. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://goo.gl/L3WD1s [5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [6] https://wiki.mozilla.org/IDN_Display_Algorithm [7] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ==========
Description was changed from ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with additional restrictions on mixing with Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level 2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4]. 3. Allow 5 aspirational scripts from UAX 31 [5] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Keep ON 'mixed script confusable' check. This is different/separate from 'script mixing restriction'. 8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. 9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008) encoded in punycode/ACE are always shown in Punycode. This is to make the display policy consistent with our decision to use UTS 46 'transitional' processing (map or drop the 4 deviation characters.). Note that this is almost identical to Mozilla's IDN display algorithm [6] except for #7, #8, #9 and an additional restrictions in #1. Two other differences are that there is NO TLD whitelist (Mozilla plans to get rid of the whitelist, but it is still used atm.) and that Mozilla's character black list is not used except for one character (U+0338 Long solidus overlay). [7] P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and callers. P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome will be updated after this CL is landed. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://goo.gl/L3WD1s [5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [6] https://wiki.mozilla.org/IDN_Display_Algorithm [7] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ========== to ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with additional restrictions on mixing with Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level 2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4]. 3. Allow 5 aspirational scripts from UAX 31 [5] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Keep ON 'mixed script confusable' check. This is different/separate from 'script mixing restriction'. 8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. 9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008) encoded in punycode/ACE are always shown in Punycode. This is to make the display policy consistent with our prior decision to use UTS 46 'transitional' processing (map or drop the 4 deviation characters.). Note that this is almost identical to Mozilla's IDN display algorithm [6] except for #7, #8, #9 and an additional restrictions in #1. Two other differences are that there is NO TLD whitelist (Mozilla plans to get rid of the whitelist, but it is still used atm.) and that Mozilla's character black list is not used except for one character (U+0338 Long solidus overlay). [7] xyz, 3 out of 25k in jp and none in ru and みんな TLDs. Most of domains filtered out in com TLD is filtered due to the character set restrictiction (#2 and #3) that accounts for 94% (2,050) of (0.2% of ~ 1 million). P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and callers. P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome will be updated after this CL is landed. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://goo.gl/L3WD1s [5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [6] https://wiki.mozilla.org/IDN_Display_Algorithm [7] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ==========
Description was changed from ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with additional restrictions on mixing with Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level 2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4]. 3. Allow 5 aspirational scripts from UAX 31 [5] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Keep ON 'mixed script confusable' check. This is different/separate from 'script mixing restriction'. 8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. 9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008) encoded in punycode/ACE are always shown in Punycode. This is to make the display policy consistent with our prior decision to use UTS 46 'transitional' processing (map or drop the 4 deviation characters.). Note that this is almost identical to Mozilla's IDN display algorithm [6] except for #7, #8, #9 and an additional restrictions in #1. Two other differences are that there is NO TLD whitelist (Mozilla plans to get rid of the whitelist, but it is still used atm.) and that Mozilla's character black list is not used except for one character (U+0338 Long solidus overlay). [7] xyz, 3 out of 25k in jp and none in ru and みんな TLDs. Most of domains filtered out in com TLD is filtered due to the character set restrictiction (#2 and #3) that accounts for 94% (2,050) of (0.2% of ~ 1 million). P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and callers. P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome will be updated after this CL is landed. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://goo.gl/L3WD1s [5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [6] https://wiki.mozilla.org/IDN_Display_Algorithm [7] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ========== to ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with additional restrictions on mixing with Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level 2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4]. 3. Allow 5 aspirational scripts from UAX 31 [5] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Keep ON 'mixed script confusable' check. This is different/separate from 'script mixing restriction' and will catch cases like 'gօօgle' with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise allowed by rules #1 ~ #5. 8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. 9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008) encoded in punycode/ACE are always shown in Punycode. This is to make the display policy consistent with our prior decision to use UTS 46 'transitional' processing (map or drop the 4 deviation characters.). Note that this is almost identical to Mozilla's IDN display algorithm [7] except for #7, #8, #9 and an additional restrictions in #1. Another difference is that Mozilla's character black list is not used except for one character (U+0338 Long solidus overlay). [8] Most of domains filtered out in ".com" TLD is filtered due to the character set restrictiction (#2 and #3) that accounts for 94% (2,050) of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD). All the IDN TLDs are shown in Unicode. So are all the IDNs in the effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k). 48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown in punycode. P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and callers. P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome will be updated after this CL is landed. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://goo.gl/L3WD1s [5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None [7] https://wiki.mozilla.org/IDN_Display_Algorithm [8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ==========
Description was changed from ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with additional restrictions on mixing with Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level 2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4]. 3. Allow 5 aspirational scripts from UAX 31 [5] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Keep ON 'mixed script confusable' check. This is different/separate from 'script mixing restriction' and will catch cases like 'gօօgle' with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise allowed by rules #1 ~ #5. 8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. 9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008) encoded in punycode/ACE are always shown in Punycode. This is to make the display policy consistent with our prior decision to use UTS 46 'transitional' processing (map or drop the 4 deviation characters.). Note that this is almost identical to Mozilla's IDN display algorithm [7] except for #7, #8, #9 and an additional restrictions in #1. Another difference is that Mozilla's character black list is not used except for one character (U+0338 Long solidus overlay). [8] Most of domains filtered out in ".com" TLD is filtered due to the character set restrictiction (#2 and #3) that accounts for 94% (2,050) of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD). All the IDN TLDs are shown in Unicode. So are all the IDNs in the effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k). 48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown in punycode. P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and callers. P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome will be updated after this CL is landed. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://goo.gl/L3WD1s [5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None [7] https://wiki.mozilla.org/IDN_Display_Algorithm [8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ========== to ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with additional restrictions on mixing with Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level 2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4]. 3. Allow 5 aspirational scripts from UAX 31 [5] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Keep ON 'mixed script confusable' check. This is different/separate from 'script mixing restriction' and will catch cases like 'gօօgle' with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise allowed by rules #1 ~ #5. 8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. 9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008) encoded in punycode/ACE are always shown in Punycode. This is to make the display policy consistent with our prior decision to use UTS 46 'transitional' processing (map or drop the 4 deviation characters.). Note that this is almost identical to Mozilla's IDN display algorithm [7] except for #7, #8, and an additional restrictions in #1. Another difference is that Mozilla's character black list is not used except for one character (U+0338 Long solidus overlay). [8] #9 is yet another difference because of Mozilla's use of UTS 46 'non-transitional' processing. Most of domains filtered out in ".com" TLD is filtered due to the character set restrictiction (#2 and #3) that accounts for 94% (2,050) of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD). All the IDN TLDs are shown in Unicode. So are all the IDNs in the effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k). 48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown in punycode. P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and callers. P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome will be updated after this CL is landed. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://goo.gl/L3WD1s [5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None [7] https://wiki.mozilla.org/IDN_Display_Algorithm [8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ==========
jshin@chromium.org changed reviewers: + brettw@chromium.org, pkasting@google.com
Ryan and Peter, Can you take a look at url_formatter changes? Brett, I added one more test to url_canon. Can you take a look? To Ryan, below is my long over-due reply to your review comments. Some of them became obsolete because I've changed the CL a lot since. Sorry for the terribly late reply (I drafted but never published) and thanks a lot for your very careful review. https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... File components/url_formatter/url_formatter.cc (right): https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:239: // with the level of script mixing, allowed characters, and types of checks. On 2015/09/08 21:45:58, Ryan Sleevi wrote: > Comment nit: I found this comment a bit hard to read. I've tried rewriting it, > both to make sure I correctly understand and to offer feedback. > > // A helper class for the first pass of IDN spoof checking. > > I'm not clear what "level of script mixing, allowed characters, and types of > checks" is meant to convey. As I read it, it suggests parameters / inputs, but > perhaps you mean it as "The policies that the Chromium security and > internationalization teams have decided on", in which case, it might be best set > > // A helper class for the first pass of IDN spoof checking, used to > // ensure that no IDN input meets Chromium's standard of spoofability. > // For a more thorough explanation of how spoof checking works in > // Chromium, see [link to Markdown file explaining?] Done. https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:247: // check is necessary. On 2015/09/08 21:45:58, Ryan Sleevi wrote: > Comment nit: Suggested rewording > > // Returns true if |label| is safe to display as Unicode. In the event > // of library failure, all IDN inputs will be treated as unsafe. > > > In this reword, I try to remove the documentation about the implementation ("is > done with |checker_|", "Besides, it internally calls"), since that should be > both obvious from the code and fragile to document. Instead, it just spells out > the API contract. Done. https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:261: // |USpoofChecker| in IDNSpoofChecker. On 2015/09/08 21:45:59, Ryan Sleevi wrote: > // A helper class for the second pass of IDN spoof checking. > // For a more thorough explanation of how spoof checking works in Chromium, > // see [link to Markdown file explaining first and second passes] > > It's unclear to me why this is a separate helper object, as opposed to just > being one spoof checker that does both passes, but I haven't read the rest of > the code yet, I'm just basing it on "read this file and try to understand from > header comments what will happen" I removed this extra class and fold that into the first class. https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:266: // in Unicode. Called by IDNSpoofChecker::Check. On 2015/09/08 21:45:58, Ryan Sleevi wrote: > Comment nit: I would remove the second sentence - that's discussing > implementation details of some 'other' implementation (even if it is in the same > file) > > // Returns true if |label| is safe to display as Unicode according to > // extended spoofability criteria. Because this extended criteria may > // be expensive, it should only be used after first ensuring that |label| > // is safe using IDNSpoofChecker::Check. > > > Of course, the problem with the above suggested reword is you see how tightly > coupled IDNSpoofChecker and IDNSpoofCheckerExtra are. If *Extra is an > implementation detail of IDNSpoofChecker, then SpoofChecker shouldn't be talking > about it. If it's the caller's responsibility to call both, then we should > document that. But if IDNSpoofChecker is calling IDNSpoofCheckerExtra and that's > the API contract, then the comments should be clearer to that point, such as > > // Implementation detail of IDNSpoofChecker. > // As some checks for spoofability are expensive to compute, > IDNSpoofCheckerExtra > // implements the secondary checks after a given |label| has passed > // the initial spoof checking. > // This class should never be called directly - spoofability should > // be checked with IDNSpoofChecker - and just exists to encapsulate the > // more expensive logic. Initially, I expected ExtraCheck to have more expensive logic (such as checking against known good domains, etc) than what id has now, but I didn't do that (at least in this CL) so that folding ExtraCheck into IDNSpoofChecker is simpler. There's a bit more initialization to do for extra check, but it's not expensive. When we have more expensive logic, we can separate that out. https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:292: // The default is highly restrictive so that it's not set explicitly. On 2015/09/08 21:45:58, Ryan Sleevi wrote: > I have trouble making sense of this sentence. It seems to suggest that the API > call is unnecessary (because it's the default), but then you're setting it, so > I'm not sure how to make sense of it. Sorry for the confusion. (yeah, it's confusing.). Anyway, I switched to moderately restrictive and the comment was rewritten accordingly. https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:294: // using that, instead. On 2015/09/08 21:45:58, Ryan Sleevi (Slow to 2-21) wrote: > Is there a bug open for this review? Just some sort of tracking bug for Chromium > :) By switching to moderately restrictive, no more need for TODO. https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:301: // released and we update our copy of ICU. On 2015/09/08 21:45:58, Ryan Sleevi wrote: > Comment nit: Avoid pronouns in comments ( > https://groups.google.com/a/chromium.org/forum/#!topic/chromium-dev/NH-S6KCkr2M > ) > > // The recommended set and inclusion set come from ... > // This list may change over time, and will be updated whenever the > // included version of ICU is updated. > > (This avoids ambiguity for things like Chromium linking against system ICU, or > non-Chromium distributions, etc - the "who is 'we'" question) Done. https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:307: allowed_set.addAll(*inclusion_set); On 2015/09/08 21:45:58, Ryan Sleevi (Slow to 2-21) wrote: > So you explain where the lists come from, but don't quite explain the policy > here (as in, why is it OK to add these to the allowed sets). > > I suspect this is something that might be best in an .md or explainer. Made the comment more self-explanatory. A separate document will have more details. https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:319: // U+1F600) ? On 2015/09/08 21:45:58, Ryan Sleevi wrote: > Document and explain :P Yeah... they're rather arbitrary (they come from ...) . I just nixed them for now. I need to give more though on what to do with Emoji and other symbols. (it'd be great if we can just pass them all (most of them) and have an entirely different layer of defense against spoofing such as safe-browsing that can do a lot more off-line processing with a lot more data). https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:327: // Hyphenation Point On 2015/09/08 21:45:58, Ryan Sleevi (Slow to 2-21) wrote: > comment nit: "." at the end of Hyphenation Point (ending the sentence) > > comment nit: This reads a bit weird, the "not yet excluded from |allowed_set|". > Up until this point, there's no exclusions from allowed_set, so it's kinda > grammatically weird. > > A suggested reword might be something like > > // The following three characters are included in (recommended? inclusion) set, > // but are blacklisted as part of Mozilla's IDN blacklist (link here). > // Explicitly remove them, which does nothing if they're removed from > // (recommended, inclusion?) set in a future ICU update. I reviewed those 3 characters again and they can be safely allowed except for U+0338 which is a bit problematic when rendered with a broken font/text renering engine because it can be mistaken for a forward slash. (with the correct rendering, it should not). I'm dropping only that character. https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:333: // them out as Mozilla does. On 2015/09/08 21:45:58, Ryan Sleevi (Slow to 2-21) wrote: > Bug #? :) Not necessary. They'd better be kept out. (per both uts46 and idna2008) Removed the comment. https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:348: // confusable characters for all letters in them. On 2015/09/08 21:45:59, Ryan Sleevi (Slow to 2-21) wrote: > comment nit: Grammatically, this is weird because of the double "because" > > // Disable whole-script-confusable checks. > // While this check would be desirable, simple strings such as > // 'pax' (Latin) and 'b<u-umlaut>cher' (Latin?) are whole script confusable > // with other scripts (such as Cyrillic/Greek). > > [Did the above capture the concerns?] > > I'm also not sure the "check against a list of well known good domain names" - > is this talking about whitelisted TLDs? whitelisted strings? Something else? Rewrote the comment with more details on a possible alternative. https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:386: "[^\\p{Katakana}\\p{Hiragana}\\p{Han}]" On 2015/09/08 21:45:58, Ryan Sleevi (Slow to 2-21) wrote: > As written, doesn't this mean that a lone katakana no, so, zo, or n (without any > context) will be allowed, since it'll fail the regex before/after? > > Or that > > "\\u0061\\u30ce" would be treated as safe (I'm probably butchering things here) Thank you for catching it. Rewrote the regex. https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:389: // mark, but we block a sequence of similary looking On 2015/09/08 21:45:58, Ryan Sleevi (Slow to 2-21) wrote: > same comments about "we" Done. https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:390: // Japanese combining marks as well. On 2015/09/08 21:45:58, Ryan Sleevi (Slow to 2-21) wrote: > So, overall comment wise, this is somewhat hard to read because the comments are > written for the positive case, but the actual regex is the negative (dangerous > patterns). > > It may help to ensure the comments match the negativity. > > // Disallow the katakana no, so, zo, or n, as they may be mistaken > // for slashes, unless they're entirely enclosed by Katakana, Hiragana, > // or Han scripts. > > // Disallow repeating Japanese accent characters. Because > // USPOOF_INVISIBLE will only check for repeated occurrences of the > // same combining mark, it's necessary to block any sequence of > // similar looking Japanese combining marks as well. Done. https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:397: // This is called only if script mixing is detected. On 2015/09/08 21:45:59, Ryan Sleevi wrote: > Same comment about documenting 'how it's used' and such. This comment seems > superfluous, but if you're trying to document a precondition, do it as such > > > // |label| is a string that contains multiple, mixed scripts. Done. https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:399: // ASCII-Latin instead of any Latin. On 2015/09/08 21:45:58, Ryan Sleevi wrote: > comment nit: s/ / / Done. https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:411: // TODO(jshin): Check spoofing attempt against a list of 'good' domains On 2015/09/08 21:45:58, Ryan Sleevi (Slow to 2-21) wrote: > BUG #? Is this domains (labels?) or TLDs? What I had in mind is now documented in another part of the code as a possible alternative with more details.
On 2016/02/25 22:52:02, Ryan Sleevi wrote: > I saw you were tweaking this CL - did you mean to publish for review? Sorry for the late reply. I've just done it.Testing against the list of IDNs from various sources has been 'fun'. :-)
Description was changed from ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with additional restrictions on mixing with Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level 2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4]. 3. Allow 5 aspirational scripts from UAX 31 [5] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Keep ON 'mixed script confusable' check. This is different/separate from 'script mixing restriction' and will catch cases like 'gօօgle' with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise allowed by rules #1 ~ #5. 8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. 9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008) encoded in punycode/ACE are always shown in Punycode. This is to make the display policy consistent with our prior decision to use UTS 46 'transitional' processing (map or drop the 4 deviation characters.). Note that this is almost identical to Mozilla's IDN display algorithm [7] except for #7, #8, and an additional restrictions in #1. Another difference is that Mozilla's character black list is not used except for one character (U+0338 Long solidus overlay). [8] #9 is yet another difference because of Mozilla's use of UTS 46 'non-transitional' processing. Most of domains filtered out in ".com" TLD is filtered due to the character set restrictiction (#2 and #3) that accounts for 94% (2,050) of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD). All the IDN TLDs are shown in Unicode. So are all the IDNs in the effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k). 48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown in punycode. P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and callers. P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome will be updated after this CL is landed. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://goo.gl/L3WD1s [5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None [7] https://wiki.mozilla.org/IDN_Display_Algorithm [8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ========== to ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with additional restrictions on mixing with Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level 2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4]. 3. Allow 5 aspirational scripts from UAX 31 [5] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Keep ON 'mixed script confusable' check. This is different/separate from 'script mixing restriction' and will catch cases like 'gօօgle' with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise allowed by rules #1 ~ #5. 8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. (this has been in place for a few years and is kept.) 9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008) encoded in punycode/ACE are always shown in Punycode. This is to make the display policy consistent with our prior decision to use UTS 46 'transitional' processing (map or drop the 4 deviation characters.). Note that this is almost identical to Mozilla's IDN display algorithm [7] except for #7, #8, and an additional restrictions in #1. Another difference is that Mozilla's character black list is not used except for one character (U+0338 Long solidus overlay). [8] #9 is yet another difference because of Mozilla's use of UTS 46 'non-transitional' processing and our use of UTS 46 'transitional' processing. Most of domains filtered out in ".com" TLD is filtered due to the character set restrictiction (#2 and #3) that accounts for 94% (2,050) of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD). All the IDN TLDs are shown in Unicode. So are all the IDNs in the effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k). 48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown in punycode. P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and adjust callers. P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome will be updated after this CL is landed. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://goo.gl/L3WD1s [5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None [7] https://wiki.mozilla.org/IDN_Display_Algorithm [8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ==========
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1258813002/460001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1258813002/460001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_chromeos_ozone_rel_ng on tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1258813002/480001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1258813002/480001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
pkasting@chromium.org changed reviewers: + pkasting@chromium.org
I'm not competent to do a functional review, so this is basically all tiny comment and style nitpicks. https://codereview.chromium.org/1258813002/diff/460001/components/omnibox/bro... File components/omnibox/browser/history_url_provider_unittest.cc (left): https://codereview.chromium.org/1258813002/diff/460001/components/omnibox/bro... components/omnibox/browser/history_url_provider_unittest.cc:888: arraysize(expected_true)); Is it possible to replace these tests with something equivalent? Specifically, some sort of punycoded URL that will not be rendered as Unicode, so we can verify inline autocompletion is allowed to work there. https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... File components/url_formatter/url_formatter.cc (right): https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:56: explicit HostComponentTransform() {} Nit: No explicit https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:194: // TODO(brettw) We may want to skip this step in the case of file URLs to Nit: ':' after (brettw) https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:242: // see http://dev.chromium.org/developers/design-documents/idn-in-google-chrome Nit: Add trailing " ." (Space before period ensures link is not corrupted) Also applies in comments below that end with a URL. https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:246: // Returns true if |label| is safe to display as Unicode. In the event Nit: Blank line above this https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:247: // of library failure, all IDN inputs will be treated as unsafe. Nit: Wrap comments as close to 80 cols as possible (several places) https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:255: void SetAllowedUnicodeSet(UErrorCode* status); Nit: Member functions before member variables (and blank line between) https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:256: DISALLOW_COPY_AND_ASSIGN(IDNSpoofChecker); Nit: Blank line above this https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:272: << " ; all IDN will be shown in punycode."; Nit: In general prefer to avoid logging, unless you have a concrete plan for collecting and using this data https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:279: // except for USPOOF_CHAR_LIMIT. Nit: I might put the "except for" phrase before the parenthesized list for clarity. https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:291: // Restrict allowed characters in IDN labels and turn on USPOOF_CHAR_LIMIT. Wait, I thought we just said we didn't turn this on? https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:296: checks |= USPOOF_AUX_INFO; Nit: Simpler: int32_t checks = uspoof_getChecks(checker_, &status) | USPOOF_AUX_INFO; https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:304: // be done on the server-side (e.g. SafeBrowsing). Nit: . -> : https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:322: UNICODE_STRING_SIMPLE("[:Latin:]"), status); Nit: Prefer to linebreak after '=' (2 places) https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:325: // Latin letters outside the ASCII. 'Script_Extensions=Latin' is Nit: Remove "the" https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:337: int32_t results = uspoof_check(checker_, label.data(), Nit: |result| would be a little more typical. Not sure if |check_result| would be any clearer. https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:340: VLOG(5) << base::UTF16ToUTF8(label) << ":0x" << std::setfill('0') Why 5 and not the more common 1? (several places) https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:357: // shown in 'fu<sharp-s>' while 'fu<sharp-s>' typed or copy'n'pasted in Nit: in -> as; copy'n'pasted -> copy-and-pasted https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:464: // and U+2027 (Hyphenation Point) have little risk and are allowed. Do we have a bug on file with Mozilla to allow these, in order to align behavior? https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... File components/url_formatter/url_formatter.h (right): https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.h:58: // used any more and will be removed. |format_type| is a bitmask I assume you're doing these sorts of changes in a followup to avoid bloating this change with a bunch of mechanical changes? https://codereview.chromium.org/1258813002/diff/460001/url/url_canon_unittest.cc File url/url_canon_unittest.cc (right): https://codereview.chromium.org/1258813002/diff/460001/url/url_canon_unittest... url/url_canon_unittest.cc:405: // Maps U+1D68C(Math Monospace Small C) to 'c'. Nit: Space before '('
I took a quick run at this before having to dash, but overall looks much better - and is much clearer to understand! What I have left right now are nits, but I'll take a more thorough dive before/on Monday. With regards to divergence from Mozilla w/r/t #7/#8, do we want to reach out to them and see if they'd be interested in aligning to the same logic as you're proposing? Seems a low-friction thing we could do in parallel. https://codereview.chromium.org/1258813002/diff/480001/components/url_formatt... File components/url_formatter/url_formatter.cc (right): https://codereview.chromium.org/1258813002/diff/480001/components/url_formatt... components/url_formatter/url_formatter.cc:246: // Returns true if |label| is safe to display as Unicode. In the event nit: a newline between 245 and 246 (maybe it's easier in an editor with syntax coloring, but was visually hard to distinguish here) https://codereview.chromium.org/1258813002/diff/480001/components/url_formatt... components/url_formatter/url_formatter.cc:255: void SetAllowedUnicodeSet(UErrorCode* status); STYLE: This method should appear between lines 250 and 251 (per https://google.github.io/styleguide/cppguide.html#Declaration_Order ) - that is, methods before members. https://codereview.chromium.org/1258813002/diff/480001/components/url_formatt... components/url_formatter/url_formatter.cc:342: VLOG_IF(5, results & USPOOF_INVISIBLE) << base::UTF16ToUTF8(label) nit: Is it necessary to convert label here the second time? Given that both are VLOG(5)'d, the preceeding line would have had the label anyways. Also, wouldn't it have had results (line 341) - so is it necessary to log this invisibility check failed message? I know it's ultimately pedantic, but since it's VLOG, it'll be in release builds, but it seems unnecessary given 340/341. https://codereview.chromium.org/1258813002/diff/480001/components/url_formatt... components/url_formatter/url_formatter.cc:365: << " deviation character in punycode"; necessary? https://codereview.chromium.org/1258813002/diff/480001/components/url_formatt... components/url_formatter/url_formatter.cc:560: << info.errors; Is this needed?
Thanks a lot for the review, Peter and Ryan. I'll address them tomorrow. (I'm ooo today). As for VLOG, actually I can get rid all of them (or convert them to DVLOG). I was on the fence as to whether or not to keep them (I used them during my test with millions of IDNs from com and other TLDs). When I convert them to DVLOG, I need to include <iomanip> only in debug build. Is it ok to do that? As for reaching out to Mozilla, I'll do. Some of differences can be almost automatically resolved if they switch to using ICU's confusable api (they already do for ToUnicode / ToASCII conversion for IDN)
On 2016/03/14 17:36:54, jshin (jungshik at google) wrote: > As for VLOG, actually I can get rid all of them (or convert them to DVLOG). I > was on the fence as to whether or not to keep them (I used them during my test > with millions of IDNs from com and other TLDs). > When I convert them to DVLOG, I need to include <iomanip> only in debug build. > Is it ok to do that? I have no strong opinion on DVLOG versus nothing. If you need to conditionally include, I believe the right macro to check is ENABLE_DLOG.
Thanks a lot for the review. I addressed review comments. In addition, I reexamined 'blacklist chars' and dropped U+2027 (Hyphenation Point) even though it does not appear in any of IDN domains I tested for several TLDs (com, jp, xyz and a few more). I was mistaken that it'd be blocked by other means, but it's not the case and it can be potentially risky because "<Katakan> U+2027 <Katakan> " can be mistaken for "<Katakana> U+30FB <Katakana>" U+05F4 (that looks like a double quotation mark) is kept for now because there are checks blocking it when used with a non-Hebrew script. The only potentially risky(?) case is when it's used by itself, but even that does not come across as a big deal. BTW, I filed a bug at bugzilla.mozilla.org about these characters: https://bugzilla.mozilla.org/show_bug.cgi?id=1257108 As for other differences, #9 is not easy to resolve (it's not directly related to this change.) . Mozilla does want to go all the way to IDNA 2008 (UTS 46 non-transitional processing) while Chrome (and IE) want to continue to use UTS 46 transitional processing. https://codereview.chromium.org/1258813002/diff/460001/components/omnibox/bro... File components/omnibox/browser/history_url_provider_unittest.cc (left): https://codereview.chromium.org/1258813002/diff/460001/components/omnibox/bro... components/omnibox/browser/history_url_provider_unittest.cc:888: arraysize(expected_true)); On 2016/03/12 01:25:07, Peter Kasting wrote: > Is it possible to replace these tests with something equivalent? Specifically, > some sort of punycoded URL that will not be rendered as Unicode, so we can > verify inline autocompletion is allowed to work there. Yup. It's possible. I'm gonna use one of punycode URLs that always have to be shown in Punycode. https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... File components/url_formatter/url_formatter.cc (right): https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:56: explicit HostComponentTransform() {} On 2016/03/12 01:25:07, Peter Kasting wrote: > Nit: No explicit Done. https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:194: // TODO(brettw) We may want to skip this step in the case of file URLs to On 2016/03/12 01:25:08, Peter Kasting wrote: > Nit: ':' after (brettw) Done. https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:242: // see http://dev.chromium.org/developers/design-documents/idn-in-google-chrome On 2016/03/12 01:25:08, Peter Kasting wrote: > Nit: Add trailing " ." (Space before period ensures link is not corrupted) > > Also applies in comments below that end with a URL. Done. https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:246: // Returns true if |label| is safe to display as Unicode. In the event On 2016/03/12 01:25:07, Peter Kasting wrote: > Nit: Blank line above this Done. https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:247: // of library failure, all IDN inputs will be treated as unsafe. On 2016/03/12 01:25:08, Peter Kasting wrote: > Nit: Wrap comments as close to 80 cols as possible (several places) Done. https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:255: void SetAllowedUnicodeSet(UErrorCode* status); On 2016/03/12 01:25:07, Peter Kasting wrote: > Nit: Member functions before member variables (and blank line between) Done. https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:256: DISALLOW_COPY_AND_ASSIGN(IDNSpoofChecker); On 2016/03/12 01:25:07, Peter Kasting wrote: > Nit: Blank line above this Done. https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:272: << " ; all IDN will be shown in punycode."; On 2016/03/12 01:25:07, Peter Kasting wrote: > Nit: In general prefer to avoid logging, unless you have a concrete plan for > collecting and using this data I added this LOG statement because in the past there was a mysterious failure to open a related function for IDN handling and having an error name from u_errorName was handy. The most likely cause was the ICU data not being loaded on Windows/Android. Because that will be logged elsewhere in ICU loading code (more was added recently by scottmg). I'll just get rid of LOG here. https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:279: // except for USPOOF_CHAR_LIMIT. On 2016/03/12 01:25:07, Peter Kasting wrote: > Nit: I might put the "except for" phrase before the parenthesized list for > clarity. Done. https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:291: // Restrict allowed characters in IDN labels and turn on USPOOF_CHAR_LIMIT. On 2016/03/12 01:25:07, Peter Kasting wrote: > Wait, I thought we just said we didn't turn this on? It's off by default and we're turning this on here. Let me reword a paragraph starting with "By default, USpoofChecker is opened with ..." to clarify. https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:296: checks |= USPOOF_AUX_INFO; On 2016/03/12 01:25:08, Peter Kasting wrote: > Nit: Simpler: > > int32_t checks = uspoof_getChecks(checker_, &status) | USPOOF_AUX_INFO; Done. https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:304: // be done on the server-side (e.g. SafeBrowsing). On 2016/03/12 01:25:08, Peter Kasting wrote: > Nit: . -> : Done. https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:322: UNICODE_STRING_SIMPLE("[:Latin:]"), status); On 2016/03/12 01:25:07, Peter Kasting wrote: > Nit: Prefer to linebreak after '=' (2 places) Done. https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:325: // Latin letters outside the ASCII. 'Script_Extensions=Latin' is On 2016/03/12 01:25:07, Peter Kasting wrote: > Nit: Remove "the" Done. https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:337: int32_t results = uspoof_check(checker_, label.data(), On 2016/03/12 01:25:07, Peter Kasting wrote: > Nit: |result| would be a little more typical. Not sure if |check_result| would > be any clearer. Done. https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:340: VLOG(5) << base::UTF16ToUTF8(label) << ":0x" << std::setfill('0') On 2016/03/12 01:25:07, Peter Kasting wrote: > Why 5 and not the more common 1? (several places) Not any particular reason. I just dropped all the logs. https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:357: // shown in 'fu<sharp-s>' while 'fu<sharp-s>' typed or copy'n'pasted in On 2016/03/12 01:25:07, Peter Kasting wrote: > Nit: in -> as; copy'n'pasted -> copy-and-pasted Done. https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.cc:464: // and U+2027 (Hyphenation Point) have little risk and are allowed. On 2016/03/12 01:25:07, Peter Kasting wrote: > Do we have a bug on file with Mozilla to allow these, in order to align > behavior? I'll file tonight. (I started but I have to run). https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... File components/url_formatter/url_formatter.h (right): https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt... components/url_formatter/url_formatter.h:58: // used any more and will be removed. |format_type| is a bitmask On 2016/03/12 01:25:08, Peter Kasting wrote: > I assume you're doing these sorts of changes in a followup to avoid bloating > this change with a bunch of mechanical changes? Yes, that's right. (I wrote that to that effect in the CL description :-)). https://codereview.chromium.org/1258813002/diff/460001/url/url_canon_unittest.cc File url/url_canon_unittest.cc (right): https://codereview.chromium.org/1258813002/diff/460001/url/url_canon_unittest... url/url_canon_unittest.cc:405: // Maps U+1D68C(Math Monospace Small C) to 'c'. On 2016/03/12 01:25:08, Peter Kasting wrote: > Nit: Space before '(' Done. https://codereview.chromium.org/1258813002/diff/480001/components/url_formatt... File components/url_formatter/url_formatter.cc (right): https://codereview.chromium.org/1258813002/diff/480001/components/url_formatt... components/url_formatter/url_formatter.cc:246: // Returns true if |label| is safe to display as Unicode. In the event On 2016/03/12 01:28:13, Ryan Sleevi wrote: > nit: a newline between 245 and 246 (maybe it's easier in an editor with syntax > coloring, but was visually hard to distinguish here) Done. https://codereview.chromium.org/1258813002/diff/480001/components/url_formatt... components/url_formatter/url_formatter.cc:255: void SetAllowedUnicodeSet(UErrorCode* status); On 2016/03/12 01:28:13, Ryan Sleevi wrote: > STYLE: This method should appear between lines 250 and 251 (per > https://google.github.io/styleguide/cppguide.html#Declaration_Order ) - that is, > methods before members. Done. https://codereview.chromium.org/1258813002/diff/480001/components/url_formatt... components/url_formatter/url_formatter.cc:342: VLOG_IF(5, results & USPOOF_INVISIBLE) << base::UTF16ToUTF8(label) On 2016/03/12 01:28:13, Ryan Sleevi wrote: > nit: Is it necessary to convert label here the second time? Given that both are > VLOG(5)'d, the preceeding line would have had the label anyways. Also, wouldn't > it have had results (line 341) - so is it necessary to log this invisibility > check failed message? > > I know it's ultimately pedantic, but since it's VLOG, it'll be in release > builds, but it seems unnecessary given 340/341. Yup. I deleted it. I used it as a quicker way to spot it during the development, but don't need it any more. https://codereview.chromium.org/1258813002/diff/480001/components/url_formatt... components/url_formatter/url_formatter.cc:365: << " deviation character in punycode"; On 2016/03/12 01:28:13, Ryan Sleevi wrote: > necessary? Deleted it. Not necessary any more. Moreover, I can quickly find those 4 characters in my test set (collected from multiple TLDs even without running this test). https://codereview.chromium.org/1258813002/diff/480001/components/url_formatt... components/url_formatter/url_formatter.cc:560: << info.errors; On 2016/03/12 01:28:13, Ryan Sleevi wrote: > Is this needed? Deleting it because it's only for my debugging during the development.
Description was changed from ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with additional restrictions on mixing with Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level 2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4]. 3. Allow 5 aspirational scripts from UAX 31 [5] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Keep ON 'mixed script confusable' check. This is different/separate from 'script mixing restriction' and will catch cases like 'gօօgle' with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise allowed by rules #1 ~ #5. 8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. (this has been in place for a few years and is kept.) 9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008) encoded in punycode/ACE are always shown in Punycode. This is to make the display policy consistent with our prior decision to use UTS 46 'transitional' processing (map or drop the 4 deviation characters.). Note that this is almost identical to Mozilla's IDN display algorithm [7] except for #7, #8, and an additional restrictions in #1. Another difference is that Mozilla's character black list is not used except for one character (U+0338 Long solidus overlay). [8] #9 is yet another difference because of Mozilla's use of UTS 46 'non-transitional' processing and our use of UTS 46 'transitional' processing. Most of domains filtered out in ".com" TLD is filtered due to the character set restrictiction (#2 and #3) that accounts for 94% (2,050) of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD). All the IDN TLDs are shown in Unicode. So are all the IDNs in the effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k). 48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown in punycode. P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and adjust callers. P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome will be updated after this CL is landed. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://goo.gl/L3WD1s [5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None [7] https://wiki.mozilla.org/IDN_Display_Algorithm [8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ========== to ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with additional restrictions on mixing with Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level 2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4]. 3. Allow 5 aspirational scripts from UAX 31 [5] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Keep ON 'mixed script confusable' check. This is different/separate from 'script mixing restriction' and will catch cases like 'gօօgle' with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise allowed by rules #1 ~ #5. 8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. (this has been in place for a few years and is kept.) 9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008) encoded in punycode/ACE are always shown in Punycode. This is to make the display policy consistent with our prior decision to use UTS 46 'transitional' processing (map or drop the 4 deviation characters.). 10. Character black list (Mozilla's : [8]) is trimmed down to two characters. Note that this is almost identical to Mozilla's IDN display algorithm [7] except for #7, #8, and an additional restrictions in #1. #9 is another difference because of Mozilla's use of UTS 46 'non-transitional' processing and our use of UTS 46 'transitional' processing. Most of domains filtered out in ".com" TLD is filtered due to the character set restrictiction (#2 and #3) that accounts for 94% (2,050) of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD). All the IDN TLDs are shown in Unicode. So are all the IDNs in the effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k). 48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown in punycode. P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and adjust callers. P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome will be updated after this CL is landed. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://goo.gl/L3WD1s [5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None [7] https://wiki.mozilla.org/IDN_Display_Algorithm [8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. See https://bugzilla.mozilla.org/show_bug.cgi?id=1257108 BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ==========
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1258813002/520001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1258813002/520001
Description was changed from ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with additional restrictions on mixing with Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level 2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4]. 3. Allow 5 aspirational scripts from UAX 31 [5] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Keep ON 'mixed script confusable' check. This is different/separate from 'script mixing restriction' and will catch cases like 'gօօgle' with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise allowed by rules #1 ~ #5. 8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. (this has been in place for a few years and is kept.) 9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008) encoded in punycode/ACE are always shown in Punycode. This is to make the display policy consistent with our prior decision to use UTS 46 'transitional' processing (map or drop the 4 deviation characters.). 10. Character black list (Mozilla's : [8]) is trimmed down to two characters. Note that this is almost identical to Mozilla's IDN display algorithm [7] except for #7, #8, and an additional restrictions in #1. #9 is another difference because of Mozilla's use of UTS 46 'non-transitional' processing and our use of UTS 46 'transitional' processing. Most of domains filtered out in ".com" TLD is filtered due to the character set restrictiction (#2 and #3) that accounts for 94% (2,050) of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD). All the IDN TLDs are shown in Unicode. So are all the IDNs in the effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k). 48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown in punycode. P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and adjust callers. P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome will be updated after this CL is landed. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://goo.gl/L3WD1s [5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None [7] https://wiki.mozilla.org/IDN_Display_Algorithm [8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. See https://bugzilla.mozilla.org/show_bug.cgi?id=1257108 BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ========== to ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with additional restrictions on mixing with Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level 2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4]. 3. Allow 5 aspirational scripts from UAX 31 [5] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Keep ON 'mixed script confusable' check. This is different/separate from 'script mixing restriction' and will catch cases like 'gօօgle' with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise allowed by rules #1 ~ #5. 8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. (this has been in place for a few years and is kept.) 9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008) encoded in punycode/ACE are always shown in Punycode. This is to make the display policy consistent with our prior decision to use UTS 46 'transitional' processing (map or drop the 4 deviation characters.). [9] 10. Character black list (Mozilla's : [8]) is trimmed down to two characters. Note that this is almost identical to Mozilla's IDN display algorithm [7] except for #7, #8, and an additional restrictions in #1. #9 is another difference because of Mozilla's use of UTS 46 'non-transitional' processing and our use of UTS 46 'transitional' processing. Most of domains filtered out in ".com" TLD is filtered due to the character set restrictiction (#2 and #3) that accounts for 94% (2,050) of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD). All the IDN TLDs are shown in Unicode. So are all the IDNs in the effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k). 48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown in punycode. P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and adjust callers. P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome will be updated after this CL is landed. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://goo.gl/L3WD1s [5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None [7] https://wiki.mozilla.org/IDN_Display_Algorithm [8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. See https://bugzilla.mozilla.org/show_bug.cgi?id=1257108 [9] This should fix https://bugs.chromium.org/p/chromium/issues/detail?id=282677#c6 BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ==========
Description was changed from ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with additional restrictions on mixing with Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level 2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4]. 3. Allow 5 aspirational scripts from UAX 31 [5] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Keep ON 'mixed script confusable' check. This is different/separate from 'script mixing restriction' and will catch cases like 'gօօgle' with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise allowed by rules #1 ~ #5. 8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. (this has been in place for a few years and is kept.) 9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008) encoded in punycode/ACE are always shown in Punycode. This is to make the display policy consistent with our prior decision to use UTS 46 'transitional' processing (map or drop the 4 deviation characters.). [9] 10. Character black list (Mozilla's : [8]) is trimmed down to two characters. Note that this is almost identical to Mozilla's IDN display algorithm [7] except for #7, #8, and an additional restrictions in #1. #9 is another difference because of Mozilla's use of UTS 46 'non-transitional' processing and our use of UTS 46 'transitional' processing. Most of domains filtered out in ".com" TLD is filtered due to the character set restrictiction (#2 and #3) that accounts for 94% (2,050) of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD). All the IDN TLDs are shown in Unicode. So are all the IDNs in the effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k). 48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown in punycode. P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and adjust callers. P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome will be updated after this CL is landed. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://goo.gl/L3WD1s [5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None [7] https://wiki.mozilla.org/IDN_Display_Algorithm [8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. See https://bugzilla.mozilla.org/show_bug.cgi?id=1257108 [9] This should fix https://bugs.chromium.org/p/chromium/issues/detail?id=282677#c6 BUG=336973 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ========== to ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with additional restrictions on mixing with Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level 2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4]. 3. Allow 5 aspirational scripts from UAX 31 [5] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Keep ON 'mixed script confusable' check. This is different/separate from 'script mixing restriction' and will catch cases like 'gօօgle' with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise allowed by rules #1 ~ #5. 8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. (this has been in place for a few years and is kept.) 9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008) encoded in punycode/ACE are always shown in Punycode. This is to make the display policy consistent with our prior decision to use UTS 46 'transitional' processing (map or drop the 4 deviation characters.). [9] 10. Character black list (Mozilla's : [8]) is trimmed down to two characters. Note that this is almost identical to Mozilla's IDN display algorithm [7] except for #7, #8, and an additional restrictions in #1. #9 is another difference because of Mozilla's use of UTS 46 'non-transitional' processing and our use of UTS 46 'transitional' processing. Most of domains filtered out in ".com" TLD is filtered due to the character set restrictiction (#2 and #3) that accounts for 94% (2,050) of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD). All the IDN TLDs are shown in Unicode. So are all the IDNs in the effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k). 48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown in punycode. P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and adjust callers. P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome will be updated after this CL is landed. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://goo.gl/L3WD1s [5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None [7] https://wiki.mozilla.org/IDN_Display_Algorithm [8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. See https://bugzilla.mozilla.org/show_bug.cgi?id=1257108 [9] This is to fix bug 595263 BUG=336973,595263 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ==========
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
Filed a mozilla bug on making two browsers aligned as closely as possible: https://bugzilla.mozilla.org/show_bug.cgi?id=1257275
On 2016/03/16 18:34:03, jshin (jungshik at google) wrote: > Filed a mozilla bug on making two browsers aligned as closely as possible: > https://bugzilla.mozilla.org/show_bug.cgi?id=1257275 And another one: https://bugzilla.mozilla.org/show_bug.cgi?id=1257108 Anyway, can you take another look at the latest PS? Thanks
Like Peter, I struggle with the functional review, but I'm trying to do my best to review the discussions on IDNA as best as possible. While I have more comment nits, I do have one question regarding design/performance and understanding the tradeoffs. Overall, this LG to me, and matches as best I can tell the available documentation, and I think you've done an admirable job in trying to build consensus. So I think with these few last things addressed, I'm happy to give the review. https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt... File components/url_formatter/url_formatter.cc (right): https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt... components/url_formatter/url_formatter.cc:264: LAZY_INSTANCE_INITIALIZER; DESIGN: Do we run the risk of lock contention here? If so, we could always consider moving g_dangerous_pattern (which I understand can only check one pattern at a time) into a thread-local storage bucket. That makes a tradeoff of memory vs CPU, but it seems like it would amortize as a win, since there will only be a few threads that invoke this routine (ideally, only the IO thread, but ChromeCast & chromoting et all make that weird) https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt... components/url_formatter/url_formatter.cc:311: // transitional processing treat them as IDNA 2003 does; maps U+00DF and s/treat/treats/, since transitional processing is singular https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt... components/url_formatter/url_formatter.cc:337: // If uspoof_check fails or any of check is flagged, treat any IDN as comment nit: "any of check is flagged" doesn't read right. Is this meant to be "if any checks are flagged" ? That might also read weird, so how does // If uspoof_check fails (due to library failure), or if any of the checks fail, treat the IDN as unsafe. Is that the same comment? https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt... components/url_formatter/url_formatter.cc:349: // 'fu<sharp-s>' would be shown in 'fu<sharp-s>' while 'fu<sharp-s>' typed s/shown in/shown as/ ? https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt... components/url_formatter/url_formatter.cc:350: // or copy and pasted as Unicode would be canonicalized to 'fuss'. This I have trouble understanding why this is (that is, canonicalized to 'fuss'), and am trying to understand if this is documenting the behaviour of the layers below it (//net and GURL) or the consumption at the layers above it (omnibox). Could you explain a little more via CL comment, and then we can see if any changes to the code comment are possible/worth it? https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt... components/url_formatter/url_formatter.cc:418: // http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts and the "and" appears to be an incomplete comment? https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt... components/url_formatter/url_formatter.cc:420: // characters are excluded. Instead, use characters listed with Status/Type comment nit: Because some characters of the aspirational scripts are excluded [from what?], it's not possible to add the aspirational scripts themselves. The goal is trying to avoid the "we", and also trying to get clarification as to what they're "excluded" from. Overall, this comment is "Not all characters in the script are included because some are excluded", but that's missing a few subjects (included in what, excluded from what) https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt... components/url_formatter/url_formatter.cc:424: // current version is 8.0.0. Is there a way to guard this with a compile-time flag, such as indicating what version of ICU is being used? In this way, any time an ICU uprev happens, the person performing the uprev re-evaluates this code and updates the compile-time guard as appropriate. https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt... components/url_formatter/url_formatter.cc:444: // U+0338 is included in the recommended set while U+05F4 and U+2027 are in s/set while/set, while/ https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt... components/url_formatter/url_formatter.cc:448: // broken font and U+2027 can be confused with U+30FB (Katakana Middle Dot). s/font and/font, and/
Description was changed from ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with additional restrictions on mixing with Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level 2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4]. 3. Allow 5 aspirational scripts from UAX 31 [5] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Keep ON 'mixed script confusable' check. This is different/separate from 'script mixing restriction' and will catch cases like 'gօօgle' with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise allowed by rules #1 ~ #5. 8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. (this has been in place for a few years and is kept.) 9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008) encoded in punycode/ACE are always shown in Punycode. This is to make the display policy consistent with our prior decision to use UTS 46 'transitional' processing (map or drop the 4 deviation characters.). [9] 10. Character black list (Mozilla's : [8]) is trimmed down to two characters. Note that this is almost identical to Mozilla's IDN display algorithm [7] except for #7, #8, and an additional restrictions in #1. #9 is another difference because of Mozilla's use of UTS 46 'non-transitional' processing and our use of UTS 46 'transitional' processing. Most of domains filtered out in ".com" TLD is filtered due to the character set restrictiction (#2 and #3) that accounts for 94% (2,050) of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD). All the IDN TLDs are shown in Unicode. So are all the IDNs in the effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k). 48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown in punycode. P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and adjust callers. P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome will be updated after this CL is landed. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://goo.gl/L3WD1s [5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None [7] https://wiki.mozilla.org/IDN_Display_Algorithm [8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. See https://bugzilla.mozilla.org/show_bug.cgi?id=1257108 [9] This is to fix bug 595263 BUG=336973,595263 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ========== to ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with additional restrictions on mixing with Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level 2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4]. 3. Allow 5 aspirational scripts from UAX 31 [5] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Keep ON 'mixed script confusable' check. This is different/separate from 'script mixing restriction' and will catch cases like 'gօօgle' with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise allowed by rules #1 ~ #5. 8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. (this has been in place for a few years and is kept.) 9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008) encoded in punycode/ACE are always shown in Punycode. This is to make the display policy consistent with our prior decision to use UTS 46 'transitional' processing (map or drop the 4 deviation characters.). [9] 10. Character black list (Mozilla's : [8]) is trimmed down to two characters. Note that this is almost identical to Mozilla's IDN display algorithm [7] except for #7, #8, and an additional restrictions in #1. #9 is another difference because of Mozilla's use of UTS 46 'non-transitional' processing and our use of UTS 46 'transitional' processing. Most of domains filtered out in ".com" TLD is filtered due to the character set restrictiction (#2 and #3) that accounts for 94% (2,050) of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD). All the IDN TLDs are shown in Unicode. So are all the IDNs in the effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k). 48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown in punycode. P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and adjust callers. P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome will be updated after this CL is landed. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://goo.gl/L3WD1s [5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None [7] https://wiki.mozilla.org/IDN_Display_Algorithm [8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. See https://bugzilla.mozilla.org/show_bug.cgi?id=1257108 [9] This is to fix bug 595263 BUG=336973,595263 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ==========
Description was changed from ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with additional restrictions on mixing with Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level 2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4]. 3. Allow 5 aspirational scripts from UAX 31 [5] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Keep ON 'mixed script confusable' check. This is different/separate from 'script mixing restriction' and will catch cases like 'gօօgle' with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise allowed by rules #1 ~ #5. 8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. (this has been in place for a few years and is kept.) 9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008) encoded in punycode/ACE are always shown in Punycode. This is to make the display policy consistent with our prior decision to use UTS 46 'transitional' processing (map or drop the 4 deviation characters.). [9] 10. Character black list (Mozilla's : [8]) is trimmed down to two characters. Note that this is almost identical to Mozilla's IDN display algorithm [7] except for #7, #8, and an additional restrictions in #1. #9 is another difference because of Mozilla's use of UTS 46 'non-transitional' processing and our use of UTS 46 'transitional' processing. Most of domains filtered out in ".com" TLD is filtered due to the character set restrictiction (#2 and #3) that accounts for 94% (2,050) of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD). All the IDN TLDs are shown in Unicode. So are all the IDNs in the effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k). 48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown in punycode. P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and adjust callers. P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome will be updated after this CL is landed. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://goo.gl/L3WD1s [5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None [7] https://wiki.mozilla.org/IDN_Display_Algorithm [8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. See https://bugzilla.mozilla.org/show_bug.cgi?id=1257108 [9] This is to fix bug 595263 BUG=336973,595263 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ========== to ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with additional restrictions on mixing with Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level 2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4]. 3. Allow 5 aspirational scripts from UAX 31 [5] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Keep ON 'mixed script confusable' check. This is different/separate from 'script mixing restriction' and will catch cases like 'gօօgle' with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise allowed by rules #1 ~ #5. 8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. (this has been in place for a few years and is kept.) 9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008) encoded in punycode/ACE are always shown in Punycode. This is to make the display policy consistent with our prior decision to use UTS 46 'transitional' processing (map or drop the 4 deviation characters.). [9] 10. Character black list (Mozilla's : [8]) is trimmed down to two characters. Note that this is almost identical to Mozilla's IDN display algorithm [7] except for #7, #8, and an additional restrictions in #1. #9 is another difference because of Mozilla's use of UTS 46 'non-transitional' processing and our use of UTS 46 'transitional' processing. Most of domains filtered out in ".com" TLD is filtered due to the character set restrictiction (#2 and #3) that accounts for 94% (2,050) of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD). All the IDN TLDs are shown in Unicode. So are all the IDNs in the effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k). 48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown in punycode. P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and adjust callers. P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome will be updated after this CL is landed. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://goo.gl/L3WD1s [5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None [7] https://wiki.mozilla.org/IDN_Display_Algorithm [8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. See https://bugzilla.mozilla.org/show_bug.cgi?id=1257108 [9] This is to "fix" bug 595263 BUG=336973,595263 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ==========
LGTM https://codereview.chromium.org/1258813002/diff/520001/components/omnibox/bro... File components/omnibox/browser/history_url_provider_unittest.cc (right): https://codereview.chromium.org/1258813002/diff/520001/components/omnibox/bro... components/omnibox/browser/history_url_provider_unittest.cc:856: // A URL that matches due to a match in the punycode URL are allowed to be the Nit: are -> is https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt... File components/url_formatter/url_formatter.cc (right): https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt... components/url_formatter/url_formatter.cc:349: // 'fu<sharp-s>' would be shown in 'fu<sharp-s>' while 'fu<sharp-s>' typed On 2016/03/16 20:30:49, Ryan Sleevi wrote: > s/shown in/shown as/ ? Yes, that was a comment I made last time that didn't actually get addressed. https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt... components/url_formatter/url_formatter.cc:420: // characters are excluded. Instead, use characters listed with Status/Type On 2016/03/16 20:30:49, Ryan Sleevi wrote: > comment nit: Because some characters of the aspirational scripts are excluded > [from what?], it's not possible to add the aspirational scripts themselves. > > The goal is trying to avoid the "we", and also trying to get clarification as to > what they're "excluded" from. Overall, this comment is "Not all characters in > the script are included because some are excluded", but that's missing a few > subjects (included in what, excluded from what) (FWIW I don't mind "we" as avoiding it sometimes results in passive voice or confused constructions, but I do agree in this case that I don't know what the comment is trying to say, probably because I don't know what it means for a "character to be excluded".) https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt... components/url_formatter/url_formatter.cc:448: // broken font and U+2027 can be confused with U+30FB (Katakana Middle Dot). On 2016/03/16 20:30:48, Ryan Sleevi wrote: > s/font and/font, and/ Hmm, I don't think that's actually better. (It would be if "because" was replaced with a semicolon, though.)
Thank you for the review again, Peter and Ryan. I updated CL to address your comments. PTAL. https://codereview.chromium.org/1258813002/diff/520001/components/omnibox/bro... File components/omnibox/browser/history_url_provider_unittest.cc (right): https://codereview.chromium.org/1258813002/diff/520001/components/omnibox/bro... components/omnibox/browser/history_url_provider_unittest.cc:856: // A URL that matches due to a match in the punycode URL are allowed to be the On 2016/03/17 06:01:24, Peter Kasting wrote: > Nit: are -> is Done. https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt... File components/url_formatter/url_formatter.cc (right): https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt... components/url_formatter/url_formatter.cc:264: LAZY_INSTANCE_INITIALIZER; On 2016/03/16 20:30:49, Ryan Sleevi wrote: > DESIGN: Do we run the risk of lock contention here? If so, we could always > consider moving g_dangerous_pattern (which I understand can only check one > pattern at a time) into a thread-local storage bucket. That makes a tradeoff of > memory vs CPU, but it seems like it would amortize as a win, since there will > only be a few threads that invoke this routine (ideally, only the IO thread, but > ChromeCast & chromoting et all make that weird) Agreed and switched to TLS. https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt... components/url_formatter/url_formatter.cc:311: // transitional processing treat them as IDNA 2003 does; maps U+00DF and On 2016/03/16 20:30:49, Ryan Sleevi wrote: > s/treat/treats/, since transitional processing is singular Done. https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt... components/url_formatter/url_formatter.cc:337: // If uspoof_check fails or any of check is flagged, treat any IDN as On 2016/03/16 20:30:49, Ryan Sleevi wrote: > comment nit: "any of check is flagged" doesn't read right. Is this meant to be > "if any checks are flagged" ? That might also read weird, so how does > > // If uspoof_check fails (due to library failure), or if any of the checks fail, > treat the IDN as unsafe. > > Is that the same comment? Done. https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt... components/url_formatter/url_formatter.cc:349: // 'fu<sharp-s>' would be shown in 'fu<sharp-s>' while 'fu<sharp-s>' typed On 2016/03/16 20:30:49, Ryan Sleevi wrote: > s/shown in/shown as/ ? Done. https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt... components/url_formatter/url_formatter.cc:350: // or copy and pasted as Unicode would be canonicalized to 'fuss'. This On 2016/03/16 20:30:48, Ryan Sleevi wrote: > I have trouble understanding why this is (that is, canonicalized to 'fuss'), and > am trying to understand if this is documenting the behaviour of the layers below > it (//net and GURL) or the consumption at the layers above it (omnibox). Could > you explain a little more via CL comment, and then we can see if any changes to > the code comment are possible/worth it? A hostname typed(or copy-n-pasted) by a user or a hostname found in a web page is canonicalized by GURL before being stored for later use. a. A punycode hostname with a deviation character is stored as it is in GURL (because it's in ASCII) b. for display purpose, it is converted to Unicode to 'fu<sharp-s>'. So far it's all right. When the url in the omnibox (with <sharp-s>) is copied, it's canonicalized by GURL (again) to 'ss' before being copied. This makes it a user confused. See http://crbug.com/595263 https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt... components/url_formatter/url_formatter.cc:418: // http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts and On 2016/03/16 20:30:49, Ryan Sleevi wrote: > the "and" appears to be an incomplete comment? Acknowledged. https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt... components/url_formatter/url_formatter.cc:420: // characters are excluded. Instead, use characters listed with Status/Type On 2016/03/16 20:30:49, Ryan Sleevi wrote: > comment nit: Because some characters of the aspirational scripts are excluded > [from what?], it's not possible to add the aspirational scripts themselves. > > The goal is trying to avoid the "we", and also trying to get clarification as to > what they're "excluded" from. Overall, this comment is "Not all characters in > the script are included because some are excluded", but that's missing a few > subjects (included in what, excluded from what) Rewrote the paragraph. https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt... components/url_formatter/url_formatter.cc:420: // characters are excluded. Instead, use characters listed with Status/Type On 2016/03/17 06:01:24, Peter Kasting wrote: > On 2016/03/16 20:30:49, Ryan Sleevi wrote: > > comment nit: Because some characters of the aspirational scripts are excluded > > [from what?], it's not possible to add the aspirational scripts themselves. > > > > The goal is trying to avoid the "we", and also trying to get clarification as > to > > what they're "excluded" from. Overall, this comment is "Not all characters in > > the script are included because some are excluded", but that's missing a few > > subjects (included in what, excluded from what) > > (FWIW I don't mind "we" as avoiding it sometimes results in passive voice or > confused constructions, but I do agree in this case that I don't know what the > comment is trying to say, probably because I don't know what it means for a > "character to be excluded".) Yeah, a lot of passive voice sentences were added :-). Anyway, I rewrote the paragraph. https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt... components/url_formatter/url_formatter.cc:424: // current version is 8.0.0. On 2016/03/16 20:30:48, Ryan Sleevi wrote: > Is there a way to guard this with a compile-time flag, such as indicating what > version of ICU is being used? In this way, any time an ICU uprev happens, the > person performing the uprev re-evaluates this code and updates the compile-time > guard as appropriate. I'm adding a compile time guard based on ICU major version number. https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt... components/url_formatter/url_formatter.cc:444: // U+0338 is included in the recommended set while U+05F4 and U+2027 are in On 2016/03/16 20:30:49, Ryan Sleevi wrote: > s/set while/set, while/ Done. https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt... components/url_formatter/url_formatter.cc:448: // broken font and U+2027 can be confused with U+30FB (Katakana Middle Dot). On 2016/03/16 20:30:48, Ryan Sleevi wrote: > s/font and/font, and/ Done. https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt... components/url_formatter/url_formatter.cc:448: // broken font and U+2027 can be confused with U+30FB (Katakana Middle Dot). On 2016/03/17 06:01:24, Peter Kasting wrote: > On 2016/03/16 20:30:48, Ryan Sleevi wrote: > > s/font and/font, and/ > > Hmm, I don't think that's actually better. (It would be if "because" was > replaced with a semicolon, though.) Replaced 'because' with a semicolon :-)
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1258813002/560001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1258813002/560001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
LGTM Thanks for taking this on! Exciting times!
Thanks a lot, Ryan, for your patience and careful review ! Brett, can you review/approve url/* (two more test cases added for url_canonicalize)?
url_canon LGTM
The CQ bit was checked by jshin@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from pkasting@chromium.org Link to the patchset: https://codereview.chromium.org/1258813002/#ps560001 (title: "more comment update per Peter")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1258813002/560001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1258813002/560001
Message was sent while issue was closed.
Committed patchset #29 (id:560001)
Message was sent while issue was closed.
Description was changed from ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with additional restrictions on mixing with Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level 2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4]. 3. Allow 5 aspirational scripts from UAX 31 [5] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Keep ON 'mixed script confusable' check. This is different/separate from 'script mixing restriction' and will catch cases like 'gօօgle' with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise allowed by rules #1 ~ #5. 8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. (this has been in place for a few years and is kept.) 9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008) encoded in punycode/ACE are always shown in Punycode. This is to make the display policy consistent with our prior decision to use UTS 46 'transitional' processing (map or drop the 4 deviation characters.). [9] 10. Character black list (Mozilla's : [8]) is trimmed down to two characters. Note that this is almost identical to Mozilla's IDN display algorithm [7] except for #7, #8, and an additional restrictions in #1. #9 is another difference because of Mozilla's use of UTS 46 'non-transitional' processing and our use of UTS 46 'transitional' processing. Most of domains filtered out in ".com" TLD is filtered due to the character set restrictiction (#2 and #3) that accounts for 94% (2,050) of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD). All the IDN TLDs are shown in Unicode. So are all the IDNs in the effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k). 48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown in punycode. P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and adjust callers. P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome will be updated after this CL is landed. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://goo.gl/L3WD1s [5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None [7] https://wiki.mozilla.org/IDN_Display_Algorithm [8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. See https://bugzilla.mozilla.org/show_bug.cgi?id=1257108 [9] This is to "fix" bug 595263 BUG=336973,595263 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* ========== to ========== Implement a new IDN display policy The new policy is language-indepedent, implemented with ICU's uspoof API and is as following: 1. Use moderately restrictive rules for script mixing [1] with additional restrictions on mixing with Latin. - Script mixing is only allowed with ASCII-Latin (instead of any Latin) + another script allowed at the moderatate restriction level 2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4]. 3. Allow 5 aspirational scripts from UAX 31 [5] 4. Do not allow labels with two or more numbering systems mixed. 5. Do not allow invisible characters or a sequence of the same combining mark. 6. Turn off whole script confusable check. It'd block some common domain labels like рф (IDN ccTLD for '.ru'), 'bücher' (German) and 'färgbolaget' (Swedish). 7. Keep ON 'mixed script confusable' check. This is different/separate from 'script mixing restriction' and will catch cases like 'gօօgle' with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise allowed by rules #1 ~ #5. 8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be mistaken as a slash. (this has been in place for a few years and is kept.) 9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008) encoded in punycode/ACE are always shown in Punycode. This is to make the display policy consistent with our prior decision to use UTS 46 'transitional' processing (map or drop the 4 deviation characters.). [9] 10. Character black list (Mozilla's : [8]) is trimmed down to two characters. Note that this is almost identical to Mozilla's IDN display algorithm [7] except for #7, #8, and an additional restrictions in #1. #9 is another difference because of Mozilla's use of UTS 46 'non-transitional' processing and our use of UTS 46 'transitional' processing. Most of domains filtered out in ".com" TLD is filtered due to the character set restrictiction (#2 and #3) that accounts for 94% (2,050) of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD). All the IDN TLDs are shown in Unicode. So are all the IDNs in the effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k). 48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown in punycode. P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up this CL with another to get rid of that parameter and adjust callers. P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome will be updated after this CL is landed. [1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection [2] http://www.unicode.org/reports/tr39 http://www.unicode.org/Public/security/latest/xidmodifications.txt [3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts [4] http://goo.gl/L3WD1s [5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts [6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None [7] https://wiki.mozilla.org/IDN_Display_Algorithm [8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are blocked or mapped any way by other restrictions/mechanism in place. See https://bugzilla.mozilla.org/show_bug.cgi?id=1257108 [9] This is to "fix" bug 595263 BUG=336973,595263 TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*, --gtest_filter=*Puny* Committed: https://crrev.com/62a928390ba06db29576bbb32606696b3e16a66c Cr-Commit-Position: refs/heads/master@{#382029} ==========
Message was sent while issue was closed.
Patchset 29 (id:??) landed as https://crrev.com/62a928390ba06db29576bbb32606696b3e16a66c Cr-Commit-Position: refs/heads/master@{#382029} |