Issue 1258813002: Implement a new IDN display policy

jungshik at Google

The CQ bit was checked by jshin@chromium.org to run a CQ dry run

5 years, 4 months ago (2015-07-28 23:40:52 UTC) #1

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1258813002/60001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1258813002/60001

5 years, 4 months ago (2015-07-28 23:41:18 UTC) #2

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

5 years, 4 months ago (2015-07-28 23:51:02 UTC) #3

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: cast_shell_linux on tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/cast_shell_linux/builds/35739) ios_dbg_simulator_ninja on ...

5 years, 4 months ago (2015-07-28 23:51:03 UTC) #4

jungshik at Google

The CQ bit was checked by jshin@chromium.org to run a CQ dry run

5 years, 4 months ago (2015-07-29 00:04:55 UTC) #5

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1258813002/80001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1258813002/80001

5 years, 4 months ago (2015-07-29 00:05:45 UTC) #6

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

5 years, 4 months ago (2015-07-29 00:16:13 UTC) #7

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: mac_chromium_compile_dbg_ng on tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_compile_dbg_ng/builds/78983)

5 years, 4 months ago (2015-07-29 00:16:13 UTC) #8

Ryan Sleevi

rsleevi@chromium.org changed reviewers: + rsleevi@chromium.org

5 years, 3 months ago (2015-08-28 00:35:14 UTC) #9

Ryan Sleevi

I don't know why I had these drafts, but uh, have some drafts? :) https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu.cc ...

5 years, 3 months ago (2015-08-28 00:35:15 UTC) #10

jungshik at Google

Thank you for the comments. I just happened to come back to this CL and ...

5 years, 3 months ago (2015-08-28 21:05:25 UTC) #11

jungshik at Google

Ryan, can you take another look? I'll also add a bunch of new IDNs to ...

5 years, 3 months ago (2015-09-01 19:47:07 UTC) #12

Ryan, can you take another look? 

I'll also add a bunch of new IDNs to the unittest.

https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu.cc
File net/base/net_util_icu.cc (right):

https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu....
net/base/net_util_icu.cc:33: class IDNSpoofChecker {
On 2015/08/28 00:35:15, Ryan Sleevi wrote:
> Document

Done.

https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu....
net/base/net_util_icu.cc:36: bool check(const base::char16* label, int
label_len);
On 2015/08/28 00:35:14, Ryan Sleevi wrote:
> naming nit: These should be called Check
> DANGER: using "int" for buffer length's is quite dangerous! Would a
> base::StringPiece16 work here (at least to control the danger). Alternatively,
> would size_t work with a checked cast?

Done.

https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu....
net/base/net_util_icu.cc:36: bool check(const base::char16* label, int
label_len);
On 2015/08/28 00:35:14, Ryan Sleevi wrote:
> naming nit: These should be called Check
> DANGER: using "int" for buffer length's is quite dangerous! Would a
> base::StringPiece16 work here (at least to control the danger). Alternatively,
> would size_t work with a checked cast?

I switched to StringPiece16 (I thought about it but didn't bother...).

https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu....
net/base/net_util_icu.cc:36: bool check(const base::char16* label, int
label_len);
On 2015/08/28 00:35:15, Ryan Sleevi wrote:
> Document

Done.

https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu....
net/base/net_util_icu.cc:46: class IDNSpoofCheckerExtra {
On 2015/08/28 00:35:15, Ryan Sleevi wrote:
> Document

Done.

https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu....
net/base/net_util_icu.cc:49: bool check(const base::char16* label, int
label_len);
On 2015/08/28 00:35:15, Ryan Sleevi wrote:
> Same comments as above

Done.

https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu....
net/base/net_util_icu.cc:63: DCHECK(U_SUCCESS(status)) << "spoof checker failed
to open with error: "
On 2015/08/28 00:35:15, Ryan Sleevi wrote:
> Is DCHECK really appropriate? Does this indicate programmer error? Or fatal
> library failure?

This should never happen unless for some reason, ICU data is not available or
ICU data is broken/malformed. Unfortunately, there are mysterious crash reports
that could only happen with ICU data missing/unavailable/broken on Android / iOS
(e.g. internal bug : b/23186531 ). 

If the ICU data is missing/malformed/.., we're hosed anyway and we can just use
CHECK here. 

Alternatively, we can move along (disabling IDN spoof check and displaying all
the IDNs in punycode) hoping that the rest of ICU is somehow all right.

https://codereview.chromium.org/1258813002/diff/100001/net/base/net_util_icu....
net/base/net_util_icu.cc:137: DCHECK(U_SUCCESS(status));
On 2015/08/28 00:35:15, Ryan Sleevi wrote:
> Shouldn't you actually handle if this can fail?

I agree. Done.

jungshik at Google

The CQ bit was checked by jshin@chromium.org to run a CQ dry run

5 years, 3 months ago (2015-09-02 18:38:23 UTC) #13

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1258813002/180001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1258813002/180001

5 years, 3 months ago (2015-09-02 18:38:43 UTC) #14

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

5 years, 3 months ago (2015-09-02 19:19:44 UTC) #15

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_rel_ng on tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/107339) win8_chromium_ng on ...

5 years, 3 months ago (2015-09-02 19:19:45 UTC) #16

Ryan Sleevi

Sorry for the delay. I've taken another pass at reviewing - while I don't know ...

5 years, 3 months ago (2015-09-08 21:45:59 UTC) #17

Sorry for the delay. I've taken another pass at reviewing - while I don't know
the ICU APIs, I'm starting to learn from reviewing this code :)

My high-level suggestion would be to consider documenting some of the policy and
trickiness in a markdown file (perhaps in this same directory), given the
Chromium Wiki shuttering. You could also do it on dev.chromium.org, although the
markdown of course has the benefit that it's in the source tree.

In particular, the discussion about "first pass" and "second pass" in the code
suggest some non-trivialness (if you just read the comments), and it doesn't
really make sense until you look at all of the code. Thus, it might help to
provide a brief high-level overview of the policies, and then it's clear how the
code implements them.

It might even just make sense as a file-level comment - a .md file may be too
heavyweight for how 'simple' it is - but that's the most significant bit of
feedback I'd offer from reviewing this.

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
File components/url_formatter/url_formatter.cc (right):

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:239: // with the level of script
mixing, allowed characters, and types of checks.
Comment nit: I found this comment a bit hard to read. I've tried rewriting it,
both to make sure I correctly understand and to offer feedback.

// A helper class for the first pass of IDN spoof checking.

I'm not clear what "level of script mixing, allowed characters, and types of
checks" is meant to convey. As I read it, it suggests parameters / inputs, but
perhaps you mean it as "The policies that the Chromium security and
internationalization teams have decided on", in which case, it might be best set

// A helper class for the first pass of IDN spoof checking, used to
// ensure that no IDN input meets Chromium's standard of spoofability.
// For a more thorough explanation of how spoof checking works in
// Chromium, see [link to Markdown file explaining?]

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:247: // check is necessary.
Comment nit: Suggested rewording

// Returns true if |label| is safe to display as Unicode. In the event
// of library failure, all IDN inputs will be treated as unsafe.


In this reword, I try to remove the documentation about the implementation ("is
done with |checker_|", "Besides, it internally calls"), since that should be
both obvious from the code and fragile to document. Instead, it just spells out
the API contract.

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:261: // |USpoofChecker| in
IDNSpoofChecker.
// A helper class for the second pass of IDN spoof checking.
// For a more thorough explanation of how spoof checking works in Chromium,
// see [link to Markdown file explaining first and second passes]

It's unclear to me why this is a separate helper object, as opposed to just
being one spoof checker that does both passes, but I haven't read the rest of
the code yet, I'm just basing it on "read this file and try to understand from
header comments what will happen"

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:266: // in Unicode. Called by
IDNSpoofChecker::Check.
Comment nit: I would remove the second sentence - that's discussing
implementation details of some 'other' implementation (even if it is in the same
file)

// Returns true if |label| is safe to display as Unicode according to
// extended spoofability criteria. Because this extended criteria may
// be expensive, it should only be used after first ensuring that |label|
// is safe using IDNSpoofChecker::Check.


Of course, the problem with the above suggested reword is you see how tightly
coupled IDNSpoofChecker and IDNSpoofCheckerExtra are. If *Extra is an
implementation detail of IDNSpoofChecker, then SpoofChecker shouldn't be talking
about it. If it's the caller's responsibility to call both, then we should
document that. But if IDNSpoofChecker is calling IDNSpoofCheckerExtra and that's
the API contract, then the comments should be clearer to that point, such as

// Implementation detail of IDNSpoofChecker.
// As some checks for spoofability are expensive to compute,
IDNSpoofCheckerExtra
// implements the secondary checks after a given |label| has passed
// the initial spoof checking.
// This class should never be called directly - spoofability should
// be checked with IDNSpoofChecker - and just exists to encapsulate the
// more expensive logic.

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:292: // The default is highly
restrictive so that it's not set explicitly.
I have trouble making sense of this sentence. It seems to suggest that the API
call is unnecessary (because it's the default), but then you're setting it, so
I'm not sure how to make sense of it.

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:294: // using that, instead.
Is there a bug open for this review? Just some sort of tracking bug for Chromium
:)

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:301: // released and we update our
copy of ICU.
Comment nit: Avoid pronouns in comments (
https://groups.google.com/a/chromium.org/forum/#!topic/chromium-dev/NH-S6KCkr2M
)

// The recommended set and inclusion set come from ...
// This list may change over time, and will be updated whenever the
// included version of ICU is updated.

(This avoids ambiguity for things like Chromium linking against system ICU, or
non-Chromium distributions, etc - the "who is 'we'" question)

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:307:
allowed_set.addAll(*inclusion_set);
So you explain where the lists come from, but don't quite explain the policy
here (as in, why is it OK to add these to the allowed sets).

I suspect this is something that might be best in an .md or explainer.

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:319: // U+1F600) ?
Document and explain :P

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:327: // Hyphenation Point
comment nit: "." at the end of Hyphenation Point (ending the sentence)

comment nit: This reads a bit weird, the "not yet excluded from |allowed_set|".
Up until this point, there's no exclusions from allowed_set, so it's kinda
grammatically weird.

A suggested reword might be something like

// The following three characters are included in (recommended? inclusion) set,
// but are blacklisted as part of Mozilla's IDN blacklist (link here).
// Explicitly remove them, which does nothing if they're removed from
// (recommended, inclusion?) set in a future ICU update.

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:333: // them out as Mozilla does.
Bug #? :)

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:348: // confusable characters for all
letters in them.
comment nit: Grammatically, this is weird because of the double "because"

// Disable whole-script-confusable checks.
// While this check would be desirable, simple strings such as
// 'pax' (Latin) and 'b<u-umlaut>cher' (Latin?) are whole script confusable
// with other scripts (such as Cyrillic/Greek).

[Did the above capture the concerns?]

I'm also not sure the "check against a list of well known good domain names" -
is this talking about whitelisted TLDs? whitelisted strings? Something else?

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:386:
"[^\\p{Katakana}\\p{Hiragana}\\p{Han}]"
As written, doesn't this mean that a lone katakana no, so, zo, or n (without any
context) will be allowed, since it'll fail the regex before/after?

Or that

"\\u0061\\u30ce" would be treated as safe (I'm probably butchering things here)

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:389: // mark, but we block a sequence
of similary looking
same comments about "we"

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:390: // Japanese combining marks as
well.
So, overall comment wise, this is somewhat hard to read because the comments are
written for the positive case, but the actual regex is the negative (dangerous
patterns).

It may help to ensure the comments match the negativity.

// Disallow the katakana no, so, zo, or n, as they may be mistaken
// for slashes, unless they're entirely enclosed by Katakana, Hiragana,
// or Han scripts.

// Disallow repeating Japanese accent characters. Because
// USPOOF_INVISIBLE will only check for repeated occurrences of the
// same combining mark, it's necessary to block any sequence of
// similar looking Japanese combining marks as well.

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:397: // This is called only if script
mixing is detected.
Same comment about documenting 'how it's used' and such. This comment seems
superfluous, but if you're trying to document a precondition, do it as such


// |label| is a string that contains multiple, mixed scripts.

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:399: // ASCII-Latin instead  of any
Latin.
comment nit: s/  / /

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:411: // TODO(jshin): Check spoofing
attempt against a list of 'good' domains
BUG #? Is this domains (labels?) or TLDs?

jungshik at Google

On 2015/12/08 01:39:27, Ryan Sleevi wrote: > Ping? yeah... it's been in my todo list. ...

5 years ago (2015-12-10 01:05:58 UTC) #19

jungshik at Google

On 2015/12/10 01:05:58, jshin (jungshik at g) wrote: > On 2015/12/08 01:39:27, Ryan Sleevi wrote: ...

5 years ago (2015-12-10 01:07:07 UTC) #20

jungshik at Google

Description was changed from ========== Implement a new IDN display policy The new policy is ...

4 years, 10 months ago (2016-02-09 21:10:03 UTC) #21

Description was changed from

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use Highly restrictive rules for script mixing [1] with an added
   restriction on Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + Han + {Hangul, Bofomopo, Hiragana+Katagana}.
2. Allow the recommended and inclusion sets from UAX 39 [2]
3. Allow 5 aspirational scripts from UAX 31 [3]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block 'pax' because
   p,a, or x has a Cyrillic look-alike.  The same is true of
   'bücher' (German) or 'färgbolaget' (Swedish).
7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash.
8. Block repeating Japanese combining marks (U+3099 - U+309C) even
   when different ones come in sequence. If the same one comes in
   a row, they'll be blocked by rule #5 above.

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and callers. 


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts


BUG=336973
TEST=net_unittests --gtest_filter=*IDN, --gtest_filter=FormatUrl*
==========

to

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use Highly restrictive rules for script mixing [1] with an added
   restriction on Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + Han + {Hangul, Bofomopo, Hiragana+Katagana}.
2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]
3. Allow 5 aspirational scripts from UAX 31 [4]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash.
8. Block repeating Japanese combining marks (U+3099 - U+309C) even
   when different ones come in sequence. If the same one comes in
   a row, they'll be blocked by rule #5 above.

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and callers.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts


BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=FormatUrl*
==========

jungshik at Google

Description was changed from ========== Implement a new IDN display policy The new policy is ...

4 years, 10 months ago (2016-02-09 21:11:32 UTC) #22

Description was changed from

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use Highly restrictive rules for script mixing [1] with an added
   restriction on Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + Han + {Hangul, Bofomopo, Hiragana+Katagana}.
2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]
3. Allow 5 aspirational scripts from UAX 31 [4]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash.
8. Block repeating Japanese combining marks (U+3099 - U+309C) even
   when different ones come in sequence. If the same one comes in
   a row, they'll be blocked by rule #5 above.

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and callers.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts


BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=FormatUrl*
==========

to

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use Highly restrictive rules for script mixing [1] with an added
   restriction on Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + Han + {Hangul, Bofomopo, Hiragana+Katagana}.
2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]
3. Allow 5 aspirational scripts from UAX 31 [4]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash.
8. Block repeating Japanese combining marks (U+3099 - U+309C) even
   when different ones come in sequence. If the same one comes in
   a row, they'll be blocked by rule #5 above.

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and callers.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts


BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*
==========

jungshik at Google

Description was changed from ========== Implement a new IDN display policy The new policy is ...

4 years, 10 months ago (2016-02-09 21:17:43 UTC) #23

Description was changed from

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use Highly restrictive rules for script mixing [1] with an added
   restriction on Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + Han + {Hangul, Bofomopo, Hiragana+Katagana}.
2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]
3. Allow 5 aspirational scripts from UAX 31 [4]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash.
8. Block repeating Japanese combining marks (U+3099 - U+309C) even
   when different ones come in sequence. If the same one comes in
   a row, they'll be blocked by rule #5 above.

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and callers.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts


BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*
==========

to

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use Highly restrictive rules for script mixing [1] with an added
   restriction on Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + Han + {Hangul, Bofomopo, Hiragana+Katagana}.
2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]
3. Allow 5 aspirational scripts from UAX 31 [4]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash.
8. Block repeating Japanese combining marks (U+3099 - U+309C) even
   when different ones come in sequence. If the same one comes in
   a row, they'll be blocked by rule #5 above.

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and callers.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts


BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

jungshik at Google

The CQ bit was checked by jshin@chromium.org to run a CQ dry run

4 years, 10 months ago (2016-02-14 03:17:40 UTC) #24

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1258813002/320001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1258813002/320001

4 years, 10 months ago (2016-02-14 03:17:50 UTC) #25

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 10 months ago (2016-02-14 04:23:23 UTC) #26

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 10 months ago (2016-02-14 04:23:23 UTC) #27

jungshik at Google

Description was changed from ========== Implement a new IDN display policy The new policy is ...

4 years, 10 months ago (2016-02-15 09:00:49 UTC) #28

Description was changed from

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use Highly restrictive rules for script mixing [1] with an added
   restriction on Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + Han + {Hangul, Bofomopo, Hiragana+Katagana}.
2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]
3. Allow 5 aspirational scripts from UAX 31 [4]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash.
8. Block repeating Japanese combining marks (U+3099 - U+309C) even
   when different ones come in sequence. If the same one comes in
   a row, they'll be blocked by rule #5 above.

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and callers.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts


BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

to

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with an added
   restriction on Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate restriction level'
   - Block about 40 non-Latin characters (that look similar to at least one of
characters in Latin-ASCII from mixing with Latin-ASCII)
2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]
3. Allow 5 aspirational scripts from UAX 31 [4]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash.
8. Block repeating Japanese combining marks (U+3099 - U+309C) even
   when different ones come in sequence. If the same one comes in
   a row, they'll be blocked by rule #5 above.


Note that this is almost identical to Mozilla's IDN display policy except for
#7, #8 and two additional restrictions in #1. Two other differences are that
there is NO TLD whitelist (Mozilla plans to get rid of the whitelist, but it is
still used atm.) and that Mozilla's character black list is not used except for
one character (Long solidus overlay). 

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and callers.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts


BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

jungshik at Google

Description was changed from ========== Implement a new IDN display policy The new policy is ...

4 years, 10 months ago (2016-02-17 18:16:57 UTC) #29

Description was changed from

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with an added
   restriction on Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate restriction level'
   - Block about 40 non-Latin characters (that look similar to at least one of
characters in Latin-ASCII from mixing with Latin-ASCII)
2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]
3. Allow 5 aspirational scripts from UAX 31 [4]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash.
8. Block repeating Japanese combining marks (U+3099 - U+309C) even
   when different ones come in sequence. If the same one comes in
   a row, they'll be blocked by rule #5 above.


Note that this is almost identical to Mozilla's IDN display policy except for
#7, #8 and two additional restrictions in #1. Two other differences are that
there is NO TLD whitelist (Mozilla plans to get rid of the whitelist, but it is
still used atm.) and that Mozilla's character black list is not used except for
one character (Long solidus overlay). 

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and callers.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts


BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

to

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with an added
   restriction on Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate restriction level'
   - Block about 40 non-Latin characters (that look similar to at least one of
characters in Latin-ASCII from mixing with Latin-ASCII)
2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]
3. Allow 5 aspirational scripts from UAX 31 [4]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash.
8. Block repeating Japanese combining marks (U+3099 - U+309C) even
   when different ones come in sequence. If the same one comes in
   a row, they'll be blocked by rule #5 above.


Note that this is almost identical to Mozilla's IDN display algorithm [5] except
for #7, #8 and two additional restrictions in #1. Two other differences are that
there is NO TLD whitelist (Mozilla plans to get rid of the whitelist, but it is
still used atm.) and that Mozilla's character black list is not used except for
one character (U+0338 Long solidus overlay). [5] 

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and callers.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[5] https://wiki.mozilla.org/IDN_Display_Algorithm
[6] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are
blocked or mapped any way by other restrictions/mechanism in place.

BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

jungshik at Google

Description was changed from ========== Implement a new IDN display policy The new policy is ...

4 years, 10 months ago (2016-02-17 18:17:15 UTC) #30

Description was changed from

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with an added
   restriction on Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate restriction level'
   - Block about 40 non-Latin characters (that look similar to at least one of
characters in Latin-ASCII from mixing with Latin-ASCII)
2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]
3. Allow 5 aspirational scripts from UAX 31 [4]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash.
8. Block repeating Japanese combining marks (U+3099 - U+309C) even
   when different ones come in sequence. If the same one comes in
   a row, they'll be blocked by rule #5 above.


Note that this is almost identical to Mozilla's IDN display algorithm [5] except
for #7, #8 and two additional restrictions in #1. Two other differences are that
there is NO TLD whitelist (Mozilla plans to get rid of the whitelist, but it is
still used atm.) and that Mozilla's character black list is not used except for
one character (U+0338 Long solidus overlay). [5] 

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and callers.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[5] https://wiki.mozilla.org/IDN_Display_Algorithm
[6] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are
blocked or mapped any way by other restrictions/mechanism in place.

BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

to

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with an added
   restriction on Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate restriction level'
   - Block about 40 non-Latin characters (that look similar to at least one of
characters in Latin-ASCII from mixing with Latin-ASCII)
2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]
3. Allow 5 aspirational scripts from UAX 31 [4]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash.
8. Block repeating Japanese combining marks (U+3099 - U+309C) even
   when different ones come in sequence. If the same one comes in
   a row, they'll be blocked by rule #5 above.


Note that this is almost identical to Mozilla's IDN display algorithm [5] except
for #7, #8 and two additional restrictions in #1. Two other differences are that
there is NO TLD whitelist (Mozilla plans to get rid of the whitelist, but it is
still used atm.) and that Mozilla's character black list is not used except for
one character (U+0338 Long solidus overlay). [6] 

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and callers.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[5] https://wiki.mozilla.org/IDN_Display_Algorithm
[6] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are
blocked or mapped any way by other restrictions/mechanism in place.

BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

jungshik at Google

Description was changed from ========== Implement a new IDN display policy The new policy is ...

4 years, 10 months ago (2016-02-17 18:19:53 UTC) #31

Description was changed from

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with an added
   restriction on Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate restriction level'
   - Block about 40 non-Latin characters (that look similar to at least one of
characters in Latin-ASCII from mixing with Latin-ASCII)
2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]
3. Allow 5 aspirational scripts from UAX 31 [4]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash.
8. Block repeating Japanese combining marks (U+3099 - U+309C) even
   when different ones come in sequence. If the same one comes in
   a row, they'll be blocked by rule #5 above.


Note that this is almost identical to Mozilla's IDN display algorithm [5] except
for #7, #8 and two additional restrictions in #1. Two other differences are that
there is NO TLD whitelist (Mozilla plans to get rid of the whitelist, but it is
still used atm.) and that Mozilla's character black list is not used except for
one character (U+0338 Long solidus overlay). [6] 

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and callers.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[5] https://wiki.mozilla.org/IDN_Display_Algorithm
[6] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are
blocked or mapped any way by other restrictions/mechanism in place.

BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

to

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with additional 
   restrictions on mixing with Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate restriction level'
   - Block about 40 non-Latin characters (that look similar to at least one of
characters in Latin-ASCII from mixing with Latin-ASCII)
2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]
3. Allow 5 aspirational scripts from UAX 31 [4]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash.
8. Block repeating Japanese combining marks (U+3099 - U+309C) even
   when different ones come in sequence. If the same one comes in
   a row, they'll be blocked by rule #5 above.


Note that this is almost identical to Mozilla's IDN display algorithm [5] except
for #7, #8 and two additional restrictions in #1. Two other differences are that
there is NO TLD whitelist (Mozilla plans to get rid of the whitelist, but it is
still used atm.) and that Mozilla's character black list is not used except for
one character (U+0338 Long solidus overlay). [6] 

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and callers.

P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome
will be updated after this CL is landed.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[5] https://wiki.mozilla.org/IDN_Display_Algorithm
[6] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are
blocked or mapped any way by other restrictions/mechanism in place.

BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

Ryan Sleevi

I saw you were tweaking this CL - did you mean to publish for review?

4 years, 10 months ago (2016-02-25 22:52:02 UTC) #32

jungshik at Google

Description was changed from ========== Implement a new IDN display policy The new policy is ...

4 years, 9 months ago (2016-03-03 19:38:47 UTC) #33

Description was changed from

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with additional 
   restrictions on mixing with Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate restriction level'
   - Block about 40 non-Latin characters (that look similar to at least one of
characters in Latin-ASCII from mixing with Latin-ASCII)
2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31 [3]
3. Allow 5 aspirational scripts from UAX 31 [4]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash.
8. Block repeating Japanese combining marks (U+3099 - U+309C) even
   when different ones come in sequence. If the same one comes in
   a row, they'll be blocked by rule #5 above.


Note that this is almost identical to Mozilla's IDN display algorithm [5] except
for #7, #8 and two additional restrictions in #1. Two other differences are that
there is NO TLD whitelist (Mozilla plans to get rid of the whitelist, but it is
still used atm.) and that Mozilla's character black list is not used except for
one character (U+0338 Long solidus overlay). [6] 

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and callers.

P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome
will be updated after this CL is landed.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[5] https://wiki.mozilla.org/IDN_Display_Algorithm
[6] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are
blocked or mapped any way by other restrictions/mechanism in place.

BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

to

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with additional 
   restrictions on mixing with Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate restriction level'
2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31
[3]. This is equivalent to [:IdentifierStatus=Allowed:]
3. Allow 5 aspirational scripts from UAX 31 [4]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Keep ON 'mixed script confusable' check. This is different/separate from
'script mixing restriction'. 
8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash.
9. Block repeating Japanese combining marks (U+3099 - U+309C) even
   when different ones come in sequence. If the same one comes in
   a row, they'll be blocked by rule #5 above.


Note that this is almost identical to Mozilla's IDN display algorithm [5] except
for #7, #8 and two additional restrictions in #1. Two other differences are that
there is NO TLD whitelist (Mozilla plans to get rid of the whitelist, but it is
still used atm.) and that Mozilla's character black list is not used except for
one character (U+0338 Long solidus overlay). [6] 

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and callers.

P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome
will be updated after this CL is landed.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[5] https://wiki.mozilla.org/IDN_Display_Algorithm
[6] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are
blocked or mapped any way by other restrictions/mechanism in place.

BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

jungshik at Google

Description was changed from ========== Implement a new IDN display policy The new policy is ...

4 years, 9 months ago (2016-03-09 00:45:42 UTC) #34

Description was changed from

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with additional 
   restrictions on mixing with Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate restriction level'
2. Allow the recommended sets from UTS 39 [2] and inclusion sets from UAX 31
[3]. This is equivalent to [:IdentifierStatus=Allowed:]
3. Allow 5 aspirational scripts from UAX 31 [4]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Keep ON 'mixed script confusable' check. This is different/separate from
'script mixing restriction'. 
8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash.
9. Block repeating Japanese combining marks (U+3099 - U+309C) even
   when different ones come in sequence. If the same one comes in
   a row, they'll be blocked by rule #5 above.


Note that this is almost identical to Mozilla's IDN display algorithm [5] except
for #7, #8 and two additional restrictions in #1. Two other differences are that
there is NO TLD whitelist (Mozilla plans to get rid of the whitelist, but it is
still used atm.) and that Mozilla's character black list is not used except for
one character (U+0338 Long solidus overlay). [6] 

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and callers.

P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome
will be updated after this CL is landed.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[5] https://wiki.mozilla.org/IDN_Display_Algorithm
[6] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them are
blocked or mapped any way by other restrictions/mechanism in place.

BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

to

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with additional
   restrictions on mixing with Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate
     restriction level
2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from
   UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4].
3. Allow 5 aspirational scripts from UAX 31 [5]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Keep ON 'mixed script confusable' check. This is different/separate
   from 'script mixing restriction'.
8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash.
9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008)
   encoded in punycode/ACE are always shown in Punycode. This is to make
   the display policy consistent with our decision to use UTS 46
   'transitional' processing (map or drop the 4 deviation characters.).

Note that this is almost identical to Mozilla's IDN display algorithm
[6] except for #7, #8, #9 and an additional restrictions in #1. Two other
differences are that there is NO TLD whitelist (Mozilla plans to get
rid of the whitelist, but it is still used atm.) and that Mozilla's
character black list is not used except for one character (U+0338 Long
solidus overlay). [7]

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and callers.

P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome
will be updated after this CL is landed.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://goo.gl/L3WD1s
[5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[6] https://wiki.mozilla.org/IDN_Display_Algorithm
[7] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them
    are blocked or mapped any way by other restrictions/mechanism in place.

BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

jungshik at Google

Description was changed from ========== Implement a new IDN display policy The new policy is ...

4 years, 9 months ago (2016-03-09 01:31:37 UTC) #35

Description was changed from

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with additional
   restrictions on mixing with Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate
     restriction level
2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from
   UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4].
3. Allow 5 aspirational scripts from UAX 31 [5]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Keep ON 'mixed script confusable' check. This is different/separate
   from 'script mixing restriction'.
8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash.
9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008)
   encoded in punycode/ACE are always shown in Punycode. This is to make
   the display policy consistent with our decision to use UTS 46
   'transitional' processing (map or drop the 4 deviation characters.).

Note that this is almost identical to Mozilla's IDN display algorithm
[6] except for #7, #8, #9 and an additional restrictions in #1. Two other
differences are that there is NO TLD whitelist (Mozilla plans to get
rid of the whitelist, but it is still used atm.) and that Mozilla's
character black list is not used except for one character (U+0338 Long
solidus overlay). [7]

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and callers.

P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome
will be updated after this CL is landed.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://goo.gl/L3WD1s
[5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[6] https://wiki.mozilla.org/IDN_Display_Algorithm
[7] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them
    are blocked or mapped any way by other restrictions/mechanism in place.

BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

to

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with additional
   restrictions on mixing with Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate
     restriction level
2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from
   UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4].
3. Allow 5 aspirational scripts from UAX 31 [5]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Keep ON 'mixed script confusable' check. This is different/separate
   from 'script mixing restriction'.
8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash.
9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008)
   encoded in punycode/ACE are always shown in Punycode. This is to make
   the display policy consistent with our prior decision to use UTS 46
   'transitional' processing (map or drop the 4 deviation characters.).

Note that this is almost identical to Mozilla's IDN display algorithm
[6] except for #7, #8, #9 and an additional restrictions in #1. Two other
differences are that there is NO TLD whitelist (Mozilla plans to get
rid of the whitelist, but it is still used atm.) and that Mozilla's
character black list is not used except for one character (U+0338 Long
solidus overlay). [7]

xyz, 3 out of 25k in jp and none in ru and みんな TLDs.

Most of domains filtered out in com TLD is filtered due to the
character set restrictiction (#2 and #3) that accounts for 94% (2,050)
of  (0.2% of ~ 1 million).

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and callers.

P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome
will be updated after this CL is landed.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://goo.gl/L3WD1s
[5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[6] https://wiki.mozilla.org/IDN_Display_Algorithm
[7] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them
    are blocked or mapped any way by other restrictions/mechanism in place.

BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

jungshik at Google

Description was changed from ========== Implement a new IDN display policy The new policy is ...

4 years, 9 months ago (2016-03-09 20:38:45 UTC) #36

Description was changed from

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with additional
   restrictions on mixing with Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate
     restriction level
2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from
   UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4].
3. Allow 5 aspirational scripts from UAX 31 [5]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Keep ON 'mixed script confusable' check. This is different/separate
   from 'script mixing restriction'.
8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash.
9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008)
   encoded in punycode/ACE are always shown in Punycode. This is to make
   the display policy consistent with our prior decision to use UTS 46
   'transitional' processing (map or drop the 4 deviation characters.).

Note that this is almost identical to Mozilla's IDN display algorithm
[6] except for #7, #8, #9 and an additional restrictions in #1. Two other
differences are that there is NO TLD whitelist (Mozilla plans to get
rid of the whitelist, but it is still used atm.) and that Mozilla's
character black list is not used except for one character (U+0338 Long
solidus overlay). [7]

xyz, 3 out of 25k in jp and none in ru and みんな TLDs.

Most of domains filtered out in com TLD is filtered due to the
character set restrictiction (#2 and #3) that accounts for 94% (2,050)
of  (0.2% of ~ 1 million).

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and callers.

P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome
will be updated after this CL is landed.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://goo.gl/L3WD1s
[5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[6] https://wiki.mozilla.org/IDN_Display_Algorithm
[7] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them
    are blocked or mapped any way by other restrictions/mechanism in place.

BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

to

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with additional
   restrictions on mixing with Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate
     restriction level
2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from
   UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4].
3. Allow 5 aspirational scripts from UAX 31 [5]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Keep ON 'mixed script confusable' check. This is different/separate
   from 'script mixing restriction' and will catch cases like 'gօօgle'
   with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise
   allowed by rules #1 ~ #5.
8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash.
9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008)
   encoded in punycode/ACE are always shown in Punycode. This is to make
   the display policy consistent with our prior decision to use UTS 46
   'transitional' processing (map or drop the 4 deviation characters.).

Note that this is almost identical to Mozilla's IDN display algorithm
[7] except for #7, #8, #9 and an additional restrictions in #1. Another
difference is that Mozilla's character black list is not used except for
one character (U+0338 Long solidus overlay). [8]

Most of domains filtered out in ".com" TLD is filtered due to the
character set restrictiction (#2 and #3) that accounts for 94% (2,050)
of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD).

All the IDN TLDs are shown in Unicode. So are all the IDNs in the
effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k).

48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown
in punycode.

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and callers.

P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome
will be updated after this CL is landed.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://goo.gl/L3WD1s
[5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None
[7] https://wiki.mozilla.org/IDN_Display_Algorithm
[8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them
    are blocked or mapped any way by other restrictions/mechanism in place.

BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

jungshik at Google

Description was changed from ========== Implement a new IDN display policy The new policy is ...

4 years, 9 months ago (2016-03-09 20:51:13 UTC) #37

Description was changed from

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with additional
   restrictions on mixing with Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate
     restriction level
2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from
   UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4].
3. Allow 5 aspirational scripts from UAX 31 [5]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Keep ON 'mixed script confusable' check. This is different/separate
   from 'script mixing restriction' and will catch cases like 'gօօgle'
   with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise
   allowed by rules #1 ~ #5.
8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash.
9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008)
   encoded in punycode/ACE are always shown in Punycode. This is to make
   the display policy consistent with our prior decision to use UTS 46
   'transitional' processing (map or drop the 4 deviation characters.).

Note that this is almost identical to Mozilla's IDN display algorithm
[7] except for #7, #8, #9 and an additional restrictions in #1. Another
difference is that Mozilla's character black list is not used except for
one character (U+0338 Long solidus overlay). [8]

Most of domains filtered out in ".com" TLD is filtered due to the
character set restrictiction (#2 and #3) that accounts for 94% (2,050)
of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD).

All the IDN TLDs are shown in Unicode. So are all the IDNs in the
effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k).

48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown
in punycode.

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and callers.

P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome
will be updated after this CL is landed.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://goo.gl/L3WD1s
[5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None
[7] https://wiki.mozilla.org/IDN_Display_Algorithm
[8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them
    are blocked or mapped any way by other restrictions/mechanism in place.

BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

to

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with additional
   restrictions on mixing with Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate
     restriction level
2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from
   UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4].
3. Allow 5 aspirational scripts from UAX 31 [5]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Keep ON 'mixed script confusable' check. This is different/separate
   from 'script mixing restriction' and will catch cases like 'gօօgle'
   with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise
   allowed by rules #1 ~ #5.
8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash.
9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008)
   encoded in punycode/ACE are always shown in Punycode. This is to make
   the display policy consistent with our prior decision to use UTS 46
   'transitional' processing (map or drop the 4 deviation characters.).

Note that this is almost identical to Mozilla's IDN display algorithm
[7] except for #7, #8, and an additional restrictions in #1. Another
difference is that Mozilla's character black list is not used except for
one character (U+0338 Long solidus overlay). [8] #9 is yet another difference
because of Mozilla's use of UTS 46 'non-transitional' processing.

Most of domains filtered out in ".com" TLD is filtered due to the
character set restrictiction (#2 and #3) that accounts for 94% (2,050)
of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD).

All the IDN TLDs are shown in Unicode. So are all the IDNs in the
effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k).

48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown
in punycode.

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and callers.

P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome
will be updated after this CL is landed.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://goo.gl/L3WD1s
[5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None
[7] https://wiki.mozilla.org/IDN_Display_Algorithm
[8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them
    are blocked or mapped any way by other restrictions/mechanism in place.

BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

jungshik at Google

jshin@chromium.org changed reviewers: + brettw@chromium.org, pkasting@google.com

4 years, 9 months ago (2016-03-11 20:05:35 UTC) #38

jungshik at Google

Ryan and Peter, Can you take a look at url_formatter changes? Brett, I added one ...

4 years, 9 months ago (2016-03-11 20:05:38 UTC) #39

Ryan and Peter, 

Can you take a look at url_formatter changes?

Brett, I added one more test to url_canon. Can you take a look? 

To Ryan, below is my long over-due reply to your review comments. Some of them
became obsolete because I've changed the CL a lot since. Sorry for the terribly
late reply (I drafted but never published) and thanks a lot for your very
careful review.

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
File components/url_formatter/url_formatter.cc (right):

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:239: // with the level of script
mixing, allowed characters, and types of checks.
On 2015/09/08 21:45:58, Ryan Sleevi wrote:
> Comment nit: I found this comment a bit hard to read. I've tried rewriting it,
> both to make sure I correctly understand and to offer feedback.
> 
> // A helper class for the first pass of IDN spoof checking.
> 
> I'm not clear what "level of script mixing, allowed characters, and types of
> checks" is meant to convey. As I read it, it suggests parameters / inputs, but
> perhaps you mean it as "The policies that the Chromium security and
> internationalization teams have decided on", in which case, it might be best
set
> 
> // A helper class for the first pass of IDN spoof checking, used to
> // ensure that no IDN input meets Chromium's standard of spoofability.
> // For a more thorough explanation of how spoof checking works in
> // Chromium, see [link to Markdown file explaining?]

Done.

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:247: // check is necessary.
On 2015/09/08 21:45:58, Ryan Sleevi wrote:
> Comment nit: Suggested rewording
> 
> // Returns true if |label| is safe to display as Unicode. In the event
> // of library failure, all IDN inputs will be treated as unsafe.
> 
> 
> In this reword, I try to remove the documentation about the implementation
("is
> done with |checker_|", "Besides, it internally calls"), since that should be
> both obvious from the code and fragile to document. Instead, it just spells
out
> the API contract.

Done.

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:261: // |USpoofChecker| in
IDNSpoofChecker.
On 2015/09/08 21:45:59, Ryan Sleevi wrote:
> // A helper class for the second pass of IDN spoof checking.
> // For a more thorough explanation of how spoof checking works in Chromium,
> // see [link to Markdown file explaining first and second passes]
> 
> It's unclear to me why this is a separate helper object, as opposed to just
> being one spoof checker that does both passes, but I haven't read the rest of
> the code yet, I'm just basing it on "read this file and try to understand from
> header comments what will happen"

I removed this extra class and fold that into the first class.

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:266: // in Unicode. Called by
IDNSpoofChecker::Check.
On 2015/09/08 21:45:58, Ryan Sleevi wrote:
> Comment nit: I would remove the second sentence - that's discussing
> implementation details of some 'other' implementation (even if it is in the
same
> file)
> 
> // Returns true if |label| is safe to display as Unicode according to
> // extended spoofability criteria. Because this extended criteria may
> // be expensive, it should only be used after first ensuring that |label|
> // is safe using IDNSpoofChecker::Check.
> 
> 
> Of course, the problem with the above suggested reword is you see how tightly
> coupled IDNSpoofChecker and IDNSpoofCheckerExtra are. If *Extra is an
> implementation detail of IDNSpoofChecker, then SpoofChecker shouldn't be
talking
> about it. If it's the caller's responsibility to call both, then we should
> document that. But if IDNSpoofChecker is calling IDNSpoofCheckerExtra and
that's
> the API contract, then the comments should be clearer to that point, such as
> 
> // Implementation detail of IDNSpoofChecker.
> // As some checks for spoofability are expensive to compute,
> IDNSpoofCheckerExtra
> // implements the secondary checks after a given |label| has passed
> // the initial spoof checking.
> // This class should never be called directly - spoofability should
> // be checked with IDNSpoofChecker - and just exists to encapsulate the
> // more expensive logic.

Initially, I expected ExtraCheck to have more expensive logic (such as checking
against known good domains, etc) than what id has now, but I didn't do that (at
least in this CL) so that folding ExtraCheck into IDNSpoofChecker is simpler.
There's a bit more initialization to do for extra check, but it's not expensive.
When we have more expensive logic, we can separate that out.

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:292: // The default is highly
restrictive so that it's not set explicitly.
On 2015/09/08 21:45:58, Ryan Sleevi wrote:
> I have trouble making sense of this sentence. It seems to suggest that the API
> call is unnecessary (because it's the default), but then you're setting it, so
> I'm not sure how to make sense of it.

Sorry for the confusion. (yeah, it's confusing.). 
Anyway, I switched to moderately restrictive and the comment was rewritten
accordingly.

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:294: // using that, instead.
On 2015/09/08 21:45:58, Ryan Sleevi (Slow to 2-21) wrote:
> Is there a bug open for this review? Just some sort of tracking bug for
Chromium
> :)

By switching to moderately restrictive, no more need for TODO.

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:301: // released and we update our
copy of ICU.
On 2015/09/08 21:45:58, Ryan Sleevi wrote:
> Comment nit: Avoid pronouns in comments (
>
https://groups.google.com/a/chromium.org/forum/#!topic/chromium-dev/NH-S6KCkr2M
> )
> 
> // The recommended set and inclusion set come from ...
> // This list may change over time, and will be updated whenever the
> // included version of ICU is updated.
> 
> (This avoids ambiguity for things like Chromium linking against system ICU,
or
> non-Chromium distributions, etc - the "who is 'we'" question)

Done.

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:307:
allowed_set.addAll(*inclusion_set);
On 2015/09/08 21:45:58, Ryan Sleevi (Slow to 2-21) wrote:
> So you explain where the lists come from, but don't quite explain the policy
> here (as in, why is it OK to add these to the allowed sets).
> 
> I suspect this is something that might be best in an .md or explainer.

Made the comment more self-explanatory. A separate document will have more
details.

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:319: // U+1F600) ?
On 2015/09/08 21:45:58, Ryan Sleevi wrote:
> Document and explain :P

Yeah... they're rather arbitrary (they come from ...) . I just nixed them for
now. I need to give more though on what to do with Emoji and other symbols.
(it'd be great if we can just pass them all (most of them) and have an entirely
different layer of defense against spoofing such as safe-browsing that can do a
lot more off-line processing with a lot more data).

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:327: // Hyphenation Point
On 2015/09/08 21:45:58, Ryan Sleevi (Slow to 2-21) wrote:
> comment nit: "." at the end of Hyphenation Point (ending the sentence)
> 
> comment nit: This reads a bit weird, the "not yet excluded from
|allowed_set|".
> Up until this point, there's no exclusions from allowed_set, so it's kinda
> grammatically weird.
> 
> A suggested reword might be something like
> 
> // The following three characters are included in (recommended? inclusion)
set,
> // but are blacklisted as part of Mozilla's IDN blacklist (link here).
> // Explicitly remove them, which does nothing if they're removed from
> // (recommended, inclusion?) set in a future ICU update.

I reviewed those 3 characters again and they can be safely allowed except for
U+0338 which is a bit problematic when rendered with a broken font/text renering
engine because it can be mistaken for a forward slash. (with the correct
rendering, it should not). I'm dropping only that character.

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:333: // them out as Mozilla does.
On 2015/09/08 21:45:58, Ryan Sleevi (Slow to 2-21) wrote:
> Bug #? :)

Not necessary. They'd better be kept out. (per both uts46 and idna2008)  Removed
the comment.

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:348: // confusable characters for all
letters in them.
On 2015/09/08 21:45:59, Ryan Sleevi (Slow to 2-21) wrote:
> comment nit: Grammatically, this is weird because of the double "because"
> 
> // Disable whole-script-confusable checks.
> // While this check would be desirable, simple strings such as
> // 'pax' (Latin) and 'b<u-umlaut>cher' (Latin?) are whole script confusable
> // with other scripts (such as Cyrillic/Greek).
> 
> [Did the above capture the concerns?]
> 
> I'm also not sure the "check against a list of well known good domain names" -
> is this talking about whitelisted TLDs? whitelisted strings? Something else?

Rewrote the comment with more details on a possible alternative.

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:386:
"[^\\p{Katakana}\\p{Hiragana}\\p{Han}]"
On 2015/09/08 21:45:58, Ryan Sleevi (Slow to 2-21) wrote:
> As written, doesn't this mean that a lone katakana no, so, zo, or n (without
any
> context) will be allowed, since it'll fail the regex before/after?
> 
> Or that
> 
> "\\u0061\\u30ce" would be treated as safe (I'm probably butchering things
here)

Thank you for catching it. Rewrote the regex.

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:389: // mark, but we block a sequence
of similary looking
On 2015/09/08 21:45:58, Ryan Sleevi (Slow to 2-21) wrote:
> same comments about "we"

Done.

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:390: // Japanese combining marks as
well.
On 2015/09/08 21:45:58, Ryan Sleevi (Slow to 2-21) wrote:
> So, overall comment wise, this is somewhat hard to read because the comments
are
> written for the positive case, but the actual regex is the negative (dangerous
> patterns).
> 
> It may help to ensure the comments match the negativity.
> 
> // Disallow the katakana no, so, zo, or n, as they may be mistaken
> // for slashes, unless they're entirely enclosed by Katakana, Hiragana,
> // or Han scripts.
> 
> // Disallow repeating Japanese accent characters. Because
> // USPOOF_INVISIBLE will only check for repeated occurrences of the
> // same combining mark, it's necessary to block any sequence of
> // similar looking Japanese combining marks as well.

Done.

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:397: // This is called only if script
mixing is detected.
On 2015/09/08 21:45:59, Ryan Sleevi wrote:
> Same comment about documenting 'how it's used' and such. This comment seems
> superfluous, but if you're trying to document a precondition, do it as such
> 
> 
> // |label| is a string that contains multiple, mixed scripts.

Done.

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:399: // ASCII-Latin instead  of any
Latin.
On 2015/09/08 21:45:58, Ryan Sleevi wrote:
> comment nit: s/  / /

Done.

https://codereview.chromium.org/1258813002/diff/180001/components/url_formatt...
components/url_formatter/url_formatter.cc:411: // TODO(jshin): Check spoofing
attempt against a list of 'good' domains
On 2015/09/08 21:45:58, Ryan Sleevi (Slow to 2-21) wrote:
> BUG #? Is this domains (labels?) or TLDs?

What I had in mind is now documented in another part of the code as a possible
alternative with more details.

jungshik at Google

On 2016/02/25 22:52:02, Ryan Sleevi wrote: > I saw you were tweaking this CL - ...

4 years, 9 months ago (2016-03-11 20:08:34 UTC) #40

jungshik at Google

Description was changed from ========== Implement a new IDN display policy The new policy is ...

4 years, 9 months ago (2016-03-11 20:11:21 UTC) #41

Description was changed from

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with additional
   restrictions on mixing with Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate
     restriction level
2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from
   UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4].
3. Allow 5 aspirational scripts from UAX 31 [5]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Keep ON 'mixed script confusable' check. This is different/separate
   from 'script mixing restriction' and will catch cases like 'gօօgle'
   with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise
   allowed by rules #1 ~ #5.
8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash.
9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008)
   encoded in punycode/ACE are always shown in Punycode. This is to make
   the display policy consistent with our prior decision to use UTS 46
   'transitional' processing (map or drop the 4 deviation characters.).

Note that this is almost identical to Mozilla's IDN display algorithm
[7] except for #7, #8, and an additional restrictions in #1. Another
difference is that Mozilla's character black list is not used except for
one character (U+0338 Long solidus overlay). [8] #9 is yet another difference
because of Mozilla's use of UTS 46 'non-transitional' processing.

Most of domains filtered out in ".com" TLD is filtered due to the
character set restrictiction (#2 and #3) that accounts for 94% (2,050)
of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD).

All the IDN TLDs are shown in Unicode. So are all the IDNs in the
effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k).

48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown
in punycode.

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and callers.

P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome
will be updated after this CL is landed.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://goo.gl/L3WD1s
[5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None
[7] https://wiki.mozilla.org/IDN_Display_Algorithm
[8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them
    are blocked or mapped any way by other restrictions/mechanism in place.

BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

to

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with additional
   restrictions on mixing with Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate
     restriction level
2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from
   UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4].
3. Allow 5 aspirational scripts from UAX 31 [5]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Keep ON 'mixed script confusable' check. This is different/separate
   from 'script mixing restriction' and will catch cases like 'gօօgle'
   with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise
   allowed by rules #1 ~ #5.
8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash. (this has been in place for a few years and is kept.)
9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008)
   encoded in punycode/ACE are always shown in Punycode. This is to make
   the display policy consistent with our prior decision to use UTS 46
   'transitional' processing (map or drop the 4 deviation characters.).

Note that this is almost identical to Mozilla's IDN display algorithm
[7] except for #7, #8, and an additional restrictions in #1. Another
difference is that Mozilla's character black list is not used except for
one character (U+0338 Long solidus overlay). [8] #9 is yet another difference
because of Mozilla's use of UTS 46 'non-transitional' processing and our use of
UTS 46 'transitional' processing. 

Most of domains filtered out in ".com" TLD is filtered due to the
character set restrictiction (#2 and #3) that accounts for 94% (2,050)
of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD).

All the IDN TLDs are shown in Unicode. So are all the IDNs in the
effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k).

48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown
in punycode.

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and adjust callers.

P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome
will be updated after this CL is landed.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://goo.gl/L3WD1s
[5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None
[7] https://wiki.mozilla.org/IDN_Display_Algorithm
[8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them
    are blocked or mapped any way by other restrictions/mechanism in place.

BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

jungshik at Google

The CQ bit was checked by jshin@chromium.org to run a CQ dry run

4 years, 9 months ago (2016-03-11 21:42:34 UTC) #42

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1258813002/460001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1258813002/460001

4 years, 9 months ago (2016-03-11 21:42:57 UTC) #43

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 9 months ago (2016-03-11 22:04:07 UTC) #44

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_chromeos_ozone_rel_ng on tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_ozone_rel_ng/builds/138939)

4 years, 9 months ago (2016-03-11 22:04:09 UTC) #45

jungshik at Google

The CQ bit was checked by jshin@chromium.org to run a CQ dry run

4 years, 9 months ago (2016-03-11 23:54:30 UTC) #46

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1258813002/480001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1258813002/480001

4 years, 9 months ago (2016-03-11 23:54:58 UTC) #47

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 9 months ago (2016-03-12 01:00:38 UTC) #48

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 9 months ago (2016-03-12 01:00:39 UTC) #49

Peter Kasting

pkasting@chromium.org changed reviewers: + pkasting@chromium.org

4 years, 9 months ago (2016-03-12 01:25:05 UTC) #50

Peter Kasting

I'm not competent to do a functional review, so this is basically all tiny comment ...

4 years, 9 months ago (2016-03-12 01:25:08 UTC) #51

Ryan Sleevi

I took a quick run at this before having to dash, but overall looks much ...

4 years, 9 months ago (2016-03-12 01:28:13 UTC) #52

jungshik at Google

Thanks a lot for the review, Peter and Ryan. I'll address them tomorrow. (I'm ooo ...

4 years, 9 months ago (2016-03-14 17:36:54 UTC) #53

Peter Kasting

On 2016/03/14 17:36:54, jshin (jungshik at google) wrote: > As for VLOG, actually I can ...

4 years, 9 months ago (2016-03-14 21:41:31 UTC) #54

jungshik at Google

Thanks a lot for the review. I addressed review comments. In addition, I reexamined 'blacklist ...

4 years, 9 months ago (2016-03-16 08:34:55 UTC) #55

Thanks a lot for the review. I addressed review comments. 

In addition, I reexamined 'blacklist chars' and dropped U+2027 (Hyphenation
Point) even though it does not appear in any of IDN domains I tested for several
TLDs (com, jp, xyz and a few more). I was mistaken that it'd be blocked by other
means, but it's not the case and it can be potentially risky because "<Katakan>
U+2027 <Katakan>
" can be mistaken for "<Katakana> U+30FB <Katakana>"

U+05F4 (that looks like a double quotation mark)  is kept for now because there
are checks blocking it when used with a non-Hebrew script. The only potentially
risky(?) case is when it's used by itself, but even that does not come across as
a big deal. 

BTW, I filed a bug at bugzilla.mozilla.org about these characters:
https://bugzilla.mozilla.org/show_bug.cgi?id=1257108

As for other differences, #9 is not easy to resolve (it's not directly related
to this change.) . Mozilla does want to go all the way to IDNA 2008 (UTS 46
non-transitional processing) while Chrome (and IE) want to continue to use UTS
46 transitional processing.

https://codereview.chromium.org/1258813002/diff/460001/components/omnibox/bro...
File components/omnibox/browser/history_url_provider_unittest.cc (left):

https://codereview.chromium.org/1258813002/diff/460001/components/omnibox/bro...
components/omnibox/browser/history_url_provider_unittest.cc:888:
arraysize(expected_true));
On 2016/03/12 01:25:07, Peter Kasting wrote:
> Is it possible to replace these tests with something equivalent? 
Specifically,
> some sort of punycoded URL that will not be rendered as Unicode, so we can
> verify inline autocompletion is allowed to work there.

Yup. It's possible. I'm gonna use one of punycode URLs that always have to be
shown in Punycode.

https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt...
File components/url_formatter/url_formatter.cc (right):

https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt...
components/url_formatter/url_formatter.cc:56: explicit HostComponentTransform()
{}
On 2016/03/12 01:25:07, Peter Kasting wrote:
> Nit: No explicit

Done.

https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt...
components/url_formatter/url_formatter.cc:194: // TODO(brettw) We may want to
skip this step in the case of file URLs to
On 2016/03/12 01:25:08, Peter Kasting wrote:
> Nit: ':' after (brettw)

Done.

https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt...
components/url_formatter/url_formatter.cc:242: // see
http://dev.chromium.org/developers/design-documents/idn-in-google-chrome
On 2016/03/12 01:25:08, Peter Kasting wrote:
> Nit: Add trailing " ."  (Space before period ensures link is not corrupted)
> 
> Also applies in comments below that end with a URL.

Done.

https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt...
components/url_formatter/url_formatter.cc:246: // Returns true if |label| is
safe to display as Unicode. In the event
On 2016/03/12 01:25:07, Peter Kasting wrote:
> Nit: Blank line above this

Done.

https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt...
components/url_formatter/url_formatter.cc:247: // of library failure, all IDN
inputs will be treated as unsafe.
On 2016/03/12 01:25:08, Peter Kasting wrote:
> Nit: Wrap comments as close to 80 cols as possible (several places)

Done.

https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt...
components/url_formatter/url_formatter.cc:255: void
SetAllowedUnicodeSet(UErrorCode* status);
On 2016/03/12 01:25:07, Peter Kasting wrote:
> Nit: Member functions before member variables (and blank line between)

Done.

https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt...
components/url_formatter/url_formatter.cc:256:
DISALLOW_COPY_AND_ASSIGN(IDNSpoofChecker);
On 2016/03/12 01:25:07, Peter Kasting wrote:
> Nit: Blank line above this

Done.

https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt...
components/url_formatter/url_formatter.cc:272: << " ; all IDN will be shown in
punycode.";
On 2016/03/12 01:25:07, Peter Kasting wrote:
> Nit: In general prefer to avoid logging, unless you have a concrete plan for
> collecting and using this data

I added this LOG statement because in the past there was a mysterious failure to
open a related function for IDN handling and having an error name from
u_errorName was handy. The most likely cause was the ICU data not being loaded
on Windows/Android. Because that will be logged elsewhere in ICU loading code
(more was added recently by scottmg). I'll just get rid of LOG here.

https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt...
components/url_formatter/url_formatter.cc:279: // except for USPOOF_CHAR_LIMIT.
On 2016/03/12 01:25:07, Peter Kasting wrote:
> Nit: I might put the "except for" phrase before the parenthesized list for
> clarity.

Done.

https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt...
components/url_formatter/url_formatter.cc:291: // Restrict allowed characters in
IDN labels and turn on USPOOF_CHAR_LIMIT.
On 2016/03/12 01:25:07, Peter Kasting wrote:
> Wait, I thought we just said we didn't turn this on?

It's off by default and we're turning this on here. Let me reword a paragraph
starting with "By default, USpoofChecker is opened with ..." to clarify.

https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt...
components/url_formatter/url_formatter.cc:296: checks |= USPOOF_AUX_INFO;
On 2016/03/12 01:25:08, Peter Kasting wrote:
> Nit: Simpler:
> 
>   int32_t checks = uspoof_getChecks(checker_, &status) | USPOOF_AUX_INFO;

Done.

https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt...
components/url_formatter/url_formatter.cc:304: // be done on the server-side
(e.g. SafeBrowsing).
On 2016/03/12 01:25:08, Peter Kasting wrote:
> Nit: . -> :

Done.

https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt...
components/url_formatter/url_formatter.cc:322:
UNICODE_STRING_SIMPLE("[:Latin:]"), status);
On 2016/03/12 01:25:07, Peter Kasting wrote:
> Nit: Prefer to linebreak after '=' (2 places)

Done.

https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt...
components/url_formatter/url_formatter.cc:325: // Latin letters outside the
ASCII. 'Script_Extensions=Latin' is
On 2016/03/12 01:25:07, Peter Kasting wrote:
> Nit: Remove "the"

Done.

https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt...
components/url_formatter/url_formatter.cc:337: int32_t results =
uspoof_check(checker_, label.data(),
On 2016/03/12 01:25:07, Peter Kasting wrote:
> Nit: |result| would be a little more typical.  Not sure if |check_result|
would
> be any clearer.

Done.

https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt...
components/url_formatter/url_formatter.cc:340: VLOG(5) <<
base::UTF16ToUTF8(label) <<  ":0x" << std::setfill('0')
On 2016/03/12 01:25:07, Peter Kasting wrote:
> Why 5 and not the more common 1?  (several places)

Not any particular reason. I just dropped all the logs.

https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt...
components/url_formatter/url_formatter.cc:357: // shown in 'fu<sharp-s>' while
'fu<sharp-s>' typed or copy'n'pasted in
On 2016/03/12 01:25:07, Peter Kasting wrote:
> Nit: in -> as; copy'n'pasted -> copy-and-pasted

Done.

https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt...
components/url_formatter/url_formatter.cc:464: // and U+2027 (Hyphenation Point)
have little risk and are allowed.
On 2016/03/12 01:25:07, Peter Kasting wrote:
> Do we have a bug on file with Mozilla to allow these, in order to align
> behavior?

I'll file tonight. (I started but I have to run).

https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt...
File components/url_formatter/url_formatter.h (right):

https://codereview.chromium.org/1258813002/diff/460001/components/url_formatt...
components/url_formatter/url_formatter.h:58: // used any more and will be
removed. |format_type| is a bitmask
On 2016/03/12 01:25:08, Peter Kasting wrote:
> I assume you're doing these sorts of changes in a followup to avoid bloating
> this change with a bunch of mechanical changes?

Yes, that's right. (I wrote that to that effect in the CL description :-)).

https://codereview.chromium.org/1258813002/diff/460001/url/url_canon_unittest.cc
File url/url_canon_unittest.cc (right):

https://codereview.chromium.org/1258813002/diff/460001/url/url_canon_unittest...
url/url_canon_unittest.cc:405: // Maps U+1D68C(Math Monospace Small C) to 'c'.
On 2016/03/12 01:25:08, Peter Kasting wrote:
> Nit: Space before '('

Done.

https://codereview.chromium.org/1258813002/diff/480001/components/url_formatt...
File components/url_formatter/url_formatter.cc (right):

https://codereview.chromium.org/1258813002/diff/480001/components/url_formatt...
components/url_formatter/url_formatter.cc:246: // Returns true if |label| is
safe to display as Unicode. In the event
On 2016/03/12 01:28:13, Ryan Sleevi wrote:
> nit: a newline between 245 and 246 (maybe it's easier in an editor with syntax
> coloring, but was visually hard to distinguish here)

Done.

https://codereview.chromium.org/1258813002/diff/480001/components/url_formatt...
components/url_formatter/url_formatter.cc:255: void
SetAllowedUnicodeSet(UErrorCode* status);
On 2016/03/12 01:28:13, Ryan Sleevi wrote:
> STYLE: This method should appear between lines 250 and 251 (per
> https://google.github.io/styleguide/cppguide.html#Declaration_Order ) - that
is,
> methods before members.

Done.

https://codereview.chromium.org/1258813002/diff/480001/components/url_formatt...
components/url_formatter/url_formatter.cc:342: VLOG_IF(5, results &
USPOOF_INVISIBLE) << base::UTF16ToUTF8(label)
On 2016/03/12 01:28:13, Ryan Sleevi wrote:
> nit: Is it necessary to convert label here the second time? Given that both
are
> VLOG(5)'d, the preceeding line would have had the label anyways. Also,
wouldn't
> it have had results (line 341) - so is it necessary to log this invisibility
> check failed message?
> 
> I know it's ultimately pedantic, but since it's VLOG, it'll be in release
> builds, but it seems unnecessary given 340/341.

Yup. I deleted it. I used it as a quicker way to spot it during the development,
but don't need it any more.

https://codereview.chromium.org/1258813002/diff/480001/components/url_formatt...
components/url_formatter/url_formatter.cc:365: << " deviation character in
punycode";
On 2016/03/12 01:28:13, Ryan Sleevi wrote:
> necessary?

Deleted it. Not necessary any more. Moreover, I can quickly find those 4
characters in my test set (collected from multiple TLDs even without running
this test).

https://codereview.chromium.org/1258813002/diff/480001/components/url_formatt...
components/url_formatter/url_formatter.cc:560: << info.errors;
On 2016/03/12 01:28:13, Ryan Sleevi wrote:
> Is this needed?

Deleting it because it's only for my debugging during the development.

jungshik at Google

Description was changed from ========== Implement a new IDN display policy The new policy is ...

4 years, 9 months ago (2016-03-16 08:38:33 UTC) #56

Description was changed from

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with additional
   restrictions on mixing with Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate
     restriction level
2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from
   UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4].
3. Allow 5 aspirational scripts from UAX 31 [5]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Keep ON 'mixed script confusable' check. This is different/separate
   from 'script mixing restriction' and will catch cases like 'gօօgle'
   with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise
   allowed by rules #1 ~ #5.
8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash. (this has been in place for a few years and is kept.)
9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008)
   encoded in punycode/ACE are always shown in Punycode. This is to make
   the display policy consistent with our prior decision to use UTS 46
   'transitional' processing (map or drop the 4 deviation characters.).

Note that this is almost identical to Mozilla's IDN display algorithm
[7] except for #7, #8, and an additional restrictions in #1. Another
difference is that Mozilla's character black list is not used except for
one character (U+0338 Long solidus overlay). [8] #9 is yet another difference
because of Mozilla's use of UTS 46 'non-transitional' processing and our use of
UTS 46 'transitional' processing. 

Most of domains filtered out in ".com" TLD is filtered due to the
character set restrictiction (#2 and #3) that accounts for 94% (2,050)
of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD).

All the IDN TLDs are shown in Unicode. So are all the IDNs in the
effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k).

48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown
in punycode.

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and adjust callers.

P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome
will be updated after this CL is landed.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://goo.gl/L3WD1s
[5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None
[7] https://wiki.mozilla.org/IDN_Display_Algorithm
[8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them
    are blocked or mapped any way by other restrictions/mechanism in place.

BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

to

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with additional
   restrictions on mixing with Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate
     restriction level
2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from
   UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4].
3. Allow 5 aspirational scripts from UAX 31 [5]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Keep ON 'mixed script confusable' check. This is different/separate
   from 'script mixing restriction' and will catch cases like 'gօօgle'
   with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise
   allowed by rules #1 ~ #5.
8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash. (this has been in place for a few years and is kept.)
9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008)
   encoded in punycode/ACE are always shown in Punycode. This is to make
   the display policy consistent with our prior decision to use UTS 46
   'transitional' processing (map or drop the 4 deviation characters.).
10. Character black list (Mozilla's : [8]) is trimmed down to two characters. 

Note that this is almost identical to Mozilla's IDN display algorithm
[7] except for #7, #8, and an additional restrictions in #1. #9 is  another
difference
because of Mozilla's use of UTS 46 'non-transitional' processing and our use of
UTS 46 'transitional' processing. 

Most of domains filtered out in ".com" TLD is filtered due to the
character set restrictiction (#2 and #3) that accounts for 94% (2,050)
of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD).

All the IDN TLDs are shown in Unicode. So are all the IDNs in the
effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k).

48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown
in punycode.

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and adjust callers.

P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome
will be updated after this CL is landed.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://goo.gl/L3WD1s
[5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None
[7] https://wiki.mozilla.org/IDN_Display_Algorithm
[8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them
    are blocked or mapped any way by other restrictions/mechanism in place.
    See https://bugzilla.mozilla.org/show_bug.cgi?id=1257108

BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

jungshik at Google

The CQ bit was checked by jshin@chromium.org to run a CQ dry run

4 years, 9 months ago (2016-03-16 08:40:19 UTC) #57

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1258813002/520001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1258813002/520001

4 years, 9 months ago (2016-03-16 08:40:24 UTC) #58

jungshik at Google

Description was changed from ========== Implement a new IDN display policy The new policy is ...

4 years, 9 months ago (2016-03-16 09:12:26 UTC) #59

Description was changed from

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with additional
   restrictions on mixing with Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate
     restriction level
2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from
   UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4].
3. Allow 5 aspirational scripts from UAX 31 [5]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Keep ON 'mixed script confusable' check. This is different/separate
   from 'script mixing restriction' and will catch cases like 'gօօgle'
   with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise
   allowed by rules #1 ~ #5.
8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash. (this has been in place for a few years and is kept.)
9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008)
   encoded in punycode/ACE are always shown in Punycode. This is to make
   the display policy consistent with our prior decision to use UTS 46
   'transitional' processing (map or drop the 4 deviation characters.).
10. Character black list (Mozilla's : [8]) is trimmed down to two characters. 

Note that this is almost identical to Mozilla's IDN display algorithm
[7] except for #7, #8, and an additional restrictions in #1. #9 is  another
difference
because of Mozilla's use of UTS 46 'non-transitional' processing and our use of
UTS 46 'transitional' processing. 

Most of domains filtered out in ".com" TLD is filtered due to the
character set restrictiction (#2 and #3) that accounts for 94% (2,050)
of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD).

All the IDN TLDs are shown in Unicode. So are all the IDNs in the
effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k).

48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown
in punycode.

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and adjust callers.

P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome
will be updated after this CL is landed.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://goo.gl/L3WD1s
[5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None
[7] https://wiki.mozilla.org/IDN_Display_Algorithm
[8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them
    are blocked or mapped any way by other restrictions/mechanism in place.
    See https://bugzilla.mozilla.org/show_bug.cgi?id=1257108

BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

to

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with additional
   restrictions on mixing with Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate
     restriction level
2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from
   UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4].
3. Allow 5 aspirational scripts from UAX 31 [5]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Keep ON 'mixed script confusable' check. This is different/separate
   from 'script mixing restriction' and will catch cases like 'gօօgle'
   with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise
   allowed by rules #1 ~ #5.
8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash. (this has been in place for a few years and is kept.)
9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008)
   encoded in punycode/ACE are always shown in Punycode. This is to make
   the display policy consistent with our prior decision to use UTS 46
   'transitional' processing (map or drop the 4 deviation characters.). [9]
10. Character black list (Mozilla's : [8]) is trimmed down to two characters. 

Note that this is almost identical to Mozilla's IDN display algorithm
[7] except for #7, #8, and an additional restrictions in #1. #9 is  another
difference
because of Mozilla's use of UTS 46 'non-transitional' processing and our use of
UTS 46 'transitional' processing. 

Most of domains filtered out in ".com" TLD is filtered due to the
character set restrictiction (#2 and #3) that accounts for 94% (2,050)
of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD).

All the IDN TLDs are shown in Unicode. So are all the IDNs in the
effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k).

48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown
in punycode.

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and adjust callers.

P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome
will be updated after this CL is landed.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://goo.gl/L3WD1s
[5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None
[7] https://wiki.mozilla.org/IDN_Display_Algorithm
[8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them
    are blocked or mapped any way by other restrictions/mechanism in place.
    See https://bugzilla.mozilla.org/show_bug.cgi?id=1257108

[9] This should fix
https://bugs.chromium.org/p/chromium/issues/detail?id=282677#c6


BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

jungshik at Google

Description was changed from ========== Implement a new IDN display policy The new policy is ...

4 years, 9 months ago (2016-03-16 09:26:11 UTC) #60

Description was changed from

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with additional
   restrictions on mixing with Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate
     restriction level
2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from
   UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4].
3. Allow 5 aspirational scripts from UAX 31 [5]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Keep ON 'mixed script confusable' check. This is different/separate
   from 'script mixing restriction' and will catch cases like 'gօօgle'
   with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise
   allowed by rules #1 ~ #5.
8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash. (this has been in place for a few years and is kept.)
9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008)
   encoded in punycode/ACE are always shown in Punycode. This is to make
   the display policy consistent with our prior decision to use UTS 46
   'transitional' processing (map or drop the 4 deviation characters.). [9]
10. Character black list (Mozilla's : [8]) is trimmed down to two characters. 

Note that this is almost identical to Mozilla's IDN display algorithm
[7] except for #7, #8, and an additional restrictions in #1. #9 is  another
difference
because of Mozilla's use of UTS 46 'non-transitional' processing and our use of
UTS 46 'transitional' processing. 

Most of domains filtered out in ".com" TLD is filtered due to the
character set restrictiction (#2 and #3) that accounts for 94% (2,050)
of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD).

All the IDN TLDs are shown in Unicode. So are all the IDNs in the
effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k).

48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown
in punycode.

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and adjust callers.

P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome
will be updated after this CL is landed.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://goo.gl/L3WD1s
[5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None
[7] https://wiki.mozilla.org/IDN_Display_Algorithm
[8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them
    are blocked or mapped any way by other restrictions/mechanism in place.
    See https://bugzilla.mozilla.org/show_bug.cgi?id=1257108

[9] This should fix
https://bugs.chromium.org/p/chromium/issues/detail?id=282677#c6


BUG=336973
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

to

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with additional
   restrictions on mixing with Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate
     restriction level
2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from
   UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4].
3. Allow 5 aspirational scripts from UAX 31 [5]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Keep ON 'mixed script confusable' check. This is different/separate
   from 'script mixing restriction' and will catch cases like 'gօօgle'
   with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise
   allowed by rules #1 ~ #5.
8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash. (this has been in place for a few years and is kept.)
9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008)
   encoded in punycode/ACE are always shown in Punycode. This is to make
   the display policy consistent with our prior decision to use UTS 46
   'transitional' processing (map or drop the 4 deviation characters.). [9]
10. Character black list (Mozilla's : [8]) is trimmed down to two characters. 

Note that this is almost identical to Mozilla's IDN display algorithm
[7] except for #7, #8, and an additional restrictions in #1. #9 is  another
difference
because of Mozilla's use of UTS 46 'non-transitional' processing and our use of
UTS 46 'transitional' processing. 

Most of domains filtered out in ".com" TLD is filtered due to the
character set restrictiction (#2 and #3) that accounts for 94% (2,050)
of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD).

All the IDN TLDs are shown in Unicode. So are all the IDNs in the
effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k).

48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown
in punycode.

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and adjust callers.

P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome
will be updated after this CL is landed.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://goo.gl/L3WD1s
[5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None
[7] https://wiki.mozilla.org/IDN_Display_Algorithm
[8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them
    are blocked or mapped any way by other restrictions/mechanism in place.
    See https://bugzilla.mozilla.org/show_bug.cgi?id=1257108

[9] This is to fix bug 595263

BUG=336973,595263
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 9 months ago (2016-03-16 10:26:23 UTC) #61

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 9 months ago (2016-03-16 10:26:24 UTC) #62

jungshik at Google

Filed a mozilla bug on making two browsers aligned as closely as possible: https://bugzilla.mozilla.org/show_bug.cgi?id=1257275

4 years, 9 months ago (2016-03-16 18:34:03 UTC) #63

jungshik at Google

On 2016/03/16 18:34:03, jshin (jungshik at google) wrote: > Filed a mozilla bug on making ...

4 years, 9 months ago (2016-03-16 18:35:28 UTC) #64

Ryan Sleevi

Like Peter, I struggle with the functional review, but I'm trying to do my best ...

4 years, 9 months ago (2016-03-16 20:30:49 UTC) #65

Like Peter, I struggle with the functional review, but I'm trying to do my best
to review the discussions on IDNA as best as possible.

While I have more comment nits, I do have one question regarding
design/performance and understanding the tradeoffs. Overall, this LG to me, and
matches as best I can tell the available documentation, and I think you've done
an admirable job in trying to build consensus. So I think with these few last
things addressed, I'm happy to give the review.

https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt...
File components/url_formatter/url_formatter.cc (right):

https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt...
components/url_formatter/url_formatter.cc:264: LAZY_INSTANCE_INITIALIZER;
DESIGN: Do we run the risk of lock contention here? If so, we could always
consider moving g_dangerous_pattern (which I understand can only check one
pattern at a time) into a thread-local storage bucket. That makes a tradeoff of
memory vs CPU, but it seems like it would amortize as a win, since there will
only be a few threads that invoke this routine (ideally, only the IO thread, but
ChromeCast & chromoting et all make that weird)

https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt...
components/url_formatter/url_formatter.cc:311: // transitional processing treat
them as IDNA 2003 does; maps U+00DF and
s/treat/treats/, since transitional processing is singular

https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt...
components/url_formatter/url_formatter.cc:337: // If uspoof_check fails or any
of check is flagged, treat any IDN as
comment nit: "any of check is flagged" doesn't read right. Is this meant to be
"if any checks are flagged" ? That might also read weird, so how does

// If uspoof_check fails (due to library failure), or if any of the checks fail,
treat the IDN as unsafe.

Is that the same comment?

https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt...
components/url_formatter/url_formatter.cc:349: // 'fu<sharp-s>' would be shown
in 'fu<sharp-s>' while 'fu<sharp-s>' typed
s/shown in/shown as/ ?

https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt...
components/url_formatter/url_formatter.cc:350: // or copy and pasted as Unicode
would be canonicalized to 'fuss'. This
I have trouble understanding why this is (that is, canonicalized to 'fuss'), and
am trying to understand if this is documenting the behaviour of the layers below
it (//net and GURL) or the consumption at the layers above it (omnibox). Could
you explain a little more via CL comment, and then we can see if any changes to
the code comment are possible/worth it?

https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt...
components/url_formatter/url_formatter.cc:418: //
http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts and
the "and" appears to be an incomplete comment?

https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt...
components/url_formatter/url_formatter.cc:420: // characters are excluded.
Instead, use characters listed with Status/Type
comment nit: Because some characters of the aspirational scripts are excluded
[from what?], it's not possible to add the aspirational scripts themselves.

The goal is trying to avoid the "we", and also trying to get clarification as to
what they're "excluded" from. Overall, this comment is "Not all characters in
the script are included because some are excluded", but that's missing a few
subjects (included in what, excluded from what)

https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt...
components/url_formatter/url_formatter.cc:424: // current version is 8.0.0.
Is there a way to guard this with a compile-time flag, such as indicating what
version of ICU is being used? In this way, any time an ICU uprev happens, the
person performing the uprev re-evaluates this code and updates the compile-time
guard as appropriate.

https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt...
components/url_formatter/url_formatter.cc:444: // U+0338 is included in the
recommended set while U+05F4 and U+2027 are in
s/set while/set, while/

https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt...
components/url_formatter/url_formatter.cc:448: // broken font and U+2027 can be
confused with U+30FB (Katakana Middle Dot).
s/font and/font, and/

jungshik at Google

Description was changed from ========== Implement a new IDN display policy The new policy is ...

4 years, 9 months ago (2016-03-16 23:28:03 UTC) #66

Description was changed from

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with additional
   restrictions on mixing with Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate
     restriction level
2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from
   UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4].
3. Allow 5 aspirational scripts from UAX 31 [5]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Keep ON 'mixed script confusable' check. This is different/separate
   from 'script mixing restriction' and will catch cases like 'gօօgle'
   with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise
   allowed by rules #1 ~ #5.
8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash. (this has been in place for a few years and is kept.)
9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008)
   encoded in punycode/ACE are always shown in Punycode. This is to make
   the display policy consistent with our prior decision to use UTS 46
   'transitional' processing (map or drop the 4 deviation characters.). [9]
10. Character black list (Mozilla's : [8]) is trimmed down to two characters. 

Note that this is almost identical to Mozilla's IDN display algorithm
[7] except for #7, #8, and an additional restrictions in #1. #9 is  another
difference
because of Mozilla's use of UTS 46 'non-transitional' processing and our use of
UTS 46 'transitional' processing. 

Most of domains filtered out in ".com" TLD is filtered due to the
character set restrictiction (#2 and #3) that accounts for 94% (2,050)
of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD).

All the IDN TLDs are shown in Unicode. So are all the IDNs in the
effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k).

48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown
in punycode.

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and adjust callers.

P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome
will be updated after this CL is landed.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://goo.gl/L3WD1s
[5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None
[7] https://wiki.mozilla.org/IDN_Display_Algorithm
[8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them
    are blocked or mapped any way by other restrictions/mechanism in place.
    See https://bugzilla.mozilla.org/show_bug.cgi?id=1257108

[9] This is to fix bug 595263

BUG=336973,595263
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

to

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with additional
   restrictions on mixing with Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate
     restriction level
2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from
   UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4].
3. Allow 5 aspirational scripts from UAX 31 [5]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Keep ON 'mixed script confusable' check. This is different/separate
   from 'script mixing restriction' and will catch cases like 'gօօgle'
   with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise
   allowed by rules #1 ~ #5.
8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash. (this has been in place for a few years and is kept.)
9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008)
   encoded in punycode/ACE are always shown in Punycode. This is to make
   the display policy consistent with our prior decision to use UTS 46
   'transitional' processing (map or drop the 4 deviation characters.). [9]
10. Character black list (Mozilla's : [8]) is trimmed down to two characters.

Note that this is almost identical to Mozilla's IDN display algorithm
[7] except for #7, #8, and an additional restrictions in #1. #9 is  another
difference
because of Mozilla's use of UTS 46 'non-transitional' processing and our use of
UTS 46 'transitional' processing.

Most of domains filtered out in ".com" TLD is filtered due to the
character set restrictiction (#2 and #3) that accounts for 94% (2,050)
of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD).

All the IDN TLDs are shown in Unicode. So are all the IDNs in the
effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k).

48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown
in punycode.

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and adjust callers.

P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome
will be updated after this CL is landed.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://goo.gl/L3WD1s
[5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None
[7] https://wiki.mozilla.org/IDN_Display_Algorithm
[8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them
    are blocked or mapped any way by other restrictions/mechanism in place.
    See https://bugzilla.mozilla.org/show_bug.cgi?id=1257108

[9] This is to fix bug 595263

BUG=336973,595263
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

jungshik at Google

Description was changed from ========== Implement a new IDN display policy The new policy is ...

4 years, 9 months ago (2016-03-17 00:29:02 UTC) #67

Description was changed from

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with additional
   restrictions on mixing with Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate
     restriction level
2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from
   UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4].
3. Allow 5 aspirational scripts from UAX 31 [5]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Keep ON 'mixed script confusable' check. This is different/separate
   from 'script mixing restriction' and will catch cases like 'gօօgle'
   with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise
   allowed by rules #1 ~ #5.
8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash. (this has been in place for a few years and is kept.)
9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008)
   encoded in punycode/ACE are always shown in Punycode. This is to make
   the display policy consistent with our prior decision to use UTS 46
   'transitional' processing (map or drop the 4 deviation characters.). [9]
10. Character black list (Mozilla's : [8]) is trimmed down to two characters.

Note that this is almost identical to Mozilla's IDN display algorithm
[7] except for #7, #8, and an additional restrictions in #1. #9 is  another
difference
because of Mozilla's use of UTS 46 'non-transitional' processing and our use of
UTS 46 'transitional' processing.

Most of domains filtered out in ".com" TLD is filtered due to the
character set restrictiction (#2 and #3) that accounts for 94% (2,050)
of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD).

All the IDN TLDs are shown in Unicode. So are all the IDNs in the
effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k).

48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown
in punycode.

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and adjust callers.

P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome
will be updated after this CL is landed.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://goo.gl/L3WD1s
[5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None
[7] https://wiki.mozilla.org/IDN_Display_Algorithm
[8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them
    are blocked or mapped any way by other restrictions/mechanism in place.
    See https://bugzilla.mozilla.org/show_bug.cgi?id=1257108

[9] This is to fix bug 595263

BUG=336973,595263
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

to

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with additional
   restrictions on mixing with Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate
     restriction level
2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from
   UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4].
3. Allow 5 aspirational scripts from UAX 31 [5]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Keep ON 'mixed script confusable' check. This is different/separate
   from 'script mixing restriction' and will catch cases like 'gօօgle'
   with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise
   allowed by rules #1 ~ #5.
8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash. (this has been in place for a few years and is kept.)
9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008)
   encoded in punycode/ACE are always shown in Punycode. This is to make
   the display policy consistent with our prior decision to use UTS 46
   'transitional' processing (map or drop the 4 deviation characters.). [9]
10. Character black list (Mozilla's : [8]) is trimmed down to two characters.

Note that this is almost identical to Mozilla's IDN display algorithm
[7] except for #7, #8, and an additional restrictions in #1. #9 is  another
difference
because of Mozilla's use of UTS 46 'non-transitional' processing and our use of
UTS 46 'transitional' processing.

Most of domains filtered out in ".com" TLD is filtered due to the
character set restrictiction (#2 and #3) that accounts for 94% (2,050)
of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD).

All the IDN TLDs are shown in Unicode. So are all the IDNs in the
effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k).

48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown
in punycode.

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and adjust callers.

P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome
will be updated after this CL is landed.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://goo.gl/L3WD1s
[5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None
[7] https://wiki.mozilla.org/IDN_Display_Algorithm
[8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them
    are blocked or mapped any way by other restrictions/mechanism in place.
    See https://bugzilla.mozilla.org/show_bug.cgi?id=1257108
[9] This is to "fix" bug 595263

BUG=336973,595263
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

Peter Kasting

LGTM https://codereview.chromium.org/1258813002/diff/520001/components/omnibox/browser/history_url_provider_unittest.cc File components/omnibox/browser/history_url_provider_unittest.cc (right): https://codereview.chromium.org/1258813002/diff/520001/components/omnibox/browser/history_url_provider_unittest.cc#newcode856 components/omnibox/browser/history_url_provider_unittest.cc:856: // A URL that matches due to a ...

4 years, 9 months ago (2016-03-17 06:01:24 UTC) #68

LGTM

https://codereview.chromium.org/1258813002/diff/520001/components/omnibox/bro...
File components/omnibox/browser/history_url_provider_unittest.cc (right):

https://codereview.chromium.org/1258813002/diff/520001/components/omnibox/bro...
components/omnibox/browser/history_url_provider_unittest.cc:856: // A URL that
matches due to a match in the punycode URL are allowed to be the
Nit: are -> is

https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt...
File components/url_formatter/url_formatter.cc (right):

https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt...
components/url_formatter/url_formatter.cc:349: // 'fu<sharp-s>' would be shown
in 'fu<sharp-s>' while 'fu<sharp-s>' typed
On 2016/03/16 20:30:49, Ryan Sleevi wrote:
> s/shown in/shown as/ ?

Yes, that was a comment I made last time that didn't actually get addressed.

https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt...
components/url_formatter/url_formatter.cc:420: // characters are excluded.
Instead, use characters listed with Status/Type
On 2016/03/16 20:30:49, Ryan Sleevi wrote:
> comment nit: Because some characters of the aspirational scripts are excluded
> [from what?], it's not possible to add the aspirational scripts themselves.
> 
> The goal is trying to avoid the "we", and also trying to get clarification as
to
> what they're "excluded" from. Overall, this comment is "Not all characters in
> the script are included because some are excluded", but that's missing a few
> subjects (included in what, excluded from what)

(FWIW I don't mind "we" as avoiding it sometimes results in passive voice or
confused constructions, but I do agree in this case that I don't know what the
comment is trying to say, probably because I don't know what it means for a
"character to be excluded".)

https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt...
components/url_formatter/url_formatter.cc:448: // broken font and U+2027 can be
confused with U+30FB (Katakana Middle Dot).
On 2016/03/16 20:30:48, Ryan Sleevi wrote:
> s/font and/font, and/

Hmm, I don't think that's actually better.  (It would be if "because" was
replaced with a semicolon, though.)

jungshik at Google

Thank you for the review again, Peter and Ryan. I updated CL to address your ...

4 years, 9 months ago (2016-03-17 07:43:26 UTC) #69

Thank you for the review  again, Peter and Ryan. 

I updated CL to address your comments. PTAL.

https://codereview.chromium.org/1258813002/diff/520001/components/omnibox/bro...
File components/omnibox/browser/history_url_provider_unittest.cc (right):

https://codereview.chromium.org/1258813002/diff/520001/components/omnibox/bro...
components/omnibox/browser/history_url_provider_unittest.cc:856: // A URL that
matches due to a match in the punycode URL are allowed to be the
On 2016/03/17 06:01:24, Peter Kasting wrote:
> Nit: are -> is

Done.

https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt...
File components/url_formatter/url_formatter.cc (right):

https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt...
components/url_formatter/url_formatter.cc:264: LAZY_INSTANCE_INITIALIZER;
On 2016/03/16 20:30:49, Ryan Sleevi wrote:
> DESIGN: Do we run the risk of lock contention here? If so, we could always
> consider moving g_dangerous_pattern (which I understand can only check one
> pattern at a time) into a thread-local storage bucket. That makes a tradeoff
of
> memory vs CPU, but it seems like it would amortize as a win, since there will
> only be a few threads that invoke this routine (ideally, only the IO thread,
but
> ChromeCast & chromoting et all make that weird)

Agreed and switched to TLS.

https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt...
components/url_formatter/url_formatter.cc:311: // transitional processing treat
them as IDNA 2003 does; maps U+00DF and
On 2016/03/16 20:30:49, Ryan Sleevi wrote:
> s/treat/treats/, since transitional processing is singular

Done.

https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt...
components/url_formatter/url_formatter.cc:337: // If uspoof_check fails or any
of check is flagged, treat any IDN as
On 2016/03/16 20:30:49, Ryan Sleevi wrote:
> comment nit: "any of check is flagged" doesn't read right. Is this meant to be
> "if any checks are flagged" ? That might also read weird, so how does
> 
> // If uspoof_check fails (due to library failure), or if any of the checks
fail,
> treat the IDN as unsafe.
> 
> Is that the same comment?

Done.

https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt...
components/url_formatter/url_formatter.cc:349: // 'fu<sharp-s>' would be shown
in 'fu<sharp-s>' while 'fu<sharp-s>' typed
On 2016/03/16 20:30:49, Ryan Sleevi wrote:
> s/shown in/shown as/ ?

Done.

https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt...
components/url_formatter/url_formatter.cc:350: // or copy and pasted as Unicode
would be canonicalized to 'fuss'. This
On 2016/03/16 20:30:48, Ryan Sleevi wrote:
> I have trouble understanding why this is (that is, canonicalized to 'fuss'),
and
> am trying to understand if this is documenting the behaviour of the layers
below
> it (//net and GURL) or the consumption at the layers above it (omnibox). Could
> you explain a little more via CL comment, and then we can see if any changes
to
> the code comment are possible/worth it?


A hostname typed(or copy-n-pasted) by a user or a hostname found in a web page
is canonicalized by GURL before being stored for later use. 

a. A punycode hostname with a deviation character is stored as it is in GURL
(because it's in ASCII)
b. for display purpose, it is converted to Unicode to 'fu<sharp-s>'. So far it's
all right. 

When the url in the omnibox (with <sharp-s>) is copied, it's canonicalized by
GURL (again) to 'ss' before being copied. This makes it a user confused. See
http://crbug.com/595263

https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt...
components/url_formatter/url_formatter.cc:418: //
http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts and
On 2016/03/16 20:30:49, Ryan Sleevi wrote:
> the "and" appears to be an incomplete comment?

Acknowledged.

https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt...
components/url_formatter/url_formatter.cc:420: // characters are excluded.
Instead, use characters listed with Status/Type
On 2016/03/16 20:30:49, Ryan Sleevi wrote:
> comment nit: Because some characters of the aspirational scripts are excluded
> [from what?], it's not possible to add the aspirational scripts themselves.
> 
> The goal is trying to avoid the "we", and also trying to get clarification as
to
> what they're "excluded" from. Overall, this comment is "Not all characters in
> the script are included because some are excluded", but that's missing a few
> subjects (included in what, excluded from what)

Rewrote the paragraph.

https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt...
components/url_formatter/url_formatter.cc:420: // characters are excluded.
Instead, use characters listed with Status/Type
On 2016/03/17 06:01:24, Peter Kasting wrote:
> On 2016/03/16 20:30:49, Ryan Sleevi wrote:
> > comment nit: Because some characters of the aspirational scripts are
excluded
> > [from what?], it's not possible to add the aspirational scripts themselves.
> > 
> > The goal is trying to avoid the "we", and also trying to get clarification
as
> to
> > what they're "excluded" from. Overall, this comment is "Not all characters
in
> > the script are included because some are excluded", but that's missing a few
> > subjects (included in what, excluded from what)
> 
> (FWIW I don't mind "we" as avoiding it sometimes results in passive voice or
> confused constructions, but I do agree in this case that I don't know what the
> comment is trying to say, probably because I don't know what it means for a
> "character to be excluded".)

Yeah, a lot of passive voice sentences were added :-). Anyway, I rewrote the
paragraph.

https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt...
components/url_formatter/url_formatter.cc:424: // current version is 8.0.0.
On 2016/03/16 20:30:48, Ryan Sleevi wrote:
> Is there a way to guard this with a compile-time flag, such as indicating what
> version of ICU is being used? In this way, any time an ICU uprev happens, the
> person performing the uprev re-evaluates this code and updates the
compile-time
> guard as appropriate.

I'm adding a compile time guard based on ICU major version number.

https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt...
components/url_formatter/url_formatter.cc:444: // U+0338 is included in the
recommended set while U+05F4 and U+2027 are in
On 2016/03/16 20:30:49, Ryan Sleevi wrote:
> s/set while/set, while/

Done.

https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt...
components/url_formatter/url_formatter.cc:448: // broken font and U+2027 can be
confused with U+30FB (Katakana Middle Dot).
On 2016/03/16 20:30:48, Ryan Sleevi wrote:
> s/font and/font, and/

Done.

https://codereview.chromium.org/1258813002/diff/520001/components/url_formatt...
components/url_formatter/url_formatter.cc:448: // broken font and U+2027 can be
confused with U+30FB (Katakana Middle Dot).
On 2016/03/17 06:01:24, Peter Kasting wrote:
> On 2016/03/16 20:30:48, Ryan Sleevi wrote:
> > s/font and/font, and/
> 
> Hmm, I don't think that's actually better.  (It would be if "because" was
> replaced with a semicolon, though.)

Replaced 'because' with a semicolon :-)

jungshik at Google

The CQ bit was checked by jshin@chromium.org to run a CQ dry run

4 years, 9 months ago (2016-03-17 07:47:36 UTC) #70

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1258813002/560001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1258813002/560001

4 years, 9 months ago (2016-03-17 07:47:53 UTC) #71

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 9 months ago (2016-03-17 08:45:42 UTC) #72

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 9 months ago (2016-03-17 08:45:44 UTC) #73

jungshik at Google

Thanks a lot, Ryan, for your patience and careful review ! Brett, can you review/approve ...

4 years, 9 months ago (2016-03-18 05:41:34 UTC) #75

jungshik at Google

The patchset sent to the CQ was uploaded after l-g-t-m from pkasting@chromium.org Link to the ...

4 years, 9 months ago (2016-03-18 17:24:25 UTC) #78

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1258813002/560001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1258813002/560001

4 years, 9 months ago (2016-03-18 17:24:40 UTC) #79

commit-bot: I haz the power

Description was changed from ========== Implement a new IDN display policy The new policy is ...

4 years, 9 months ago (2016-03-18 18:45:36 UTC) #81

Message was sent while issue was closed.

Description was changed from

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with additional
   restrictions on mixing with Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate
     restriction level
2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from
   UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4].
3. Allow 5 aspirational scripts from UAX 31 [5]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Keep ON 'mixed script confusable' check. This is different/separate
   from 'script mixing restriction' and will catch cases like 'gօօgle'
   with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise
   allowed by rules #1 ~ #5.
8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash. (this has been in place for a few years and is kept.)
9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008)
   encoded in punycode/ACE are always shown in Punycode. This is to make
   the display policy consistent with our prior decision to use UTS 46
   'transitional' processing (map or drop the 4 deviation characters.). [9]
10. Character black list (Mozilla's : [8]) is trimmed down to two characters.

Note that this is almost identical to Mozilla's IDN display algorithm
[7] except for #7, #8, and an additional restrictions in #1. #9 is  another
difference
because of Mozilla's use of UTS 46 'non-transitional' processing and our use of
UTS 46 'transitional' processing.

Most of domains filtered out in ".com" TLD is filtered due to the
character set restrictiction (#2 and #3) that accounts for 94% (2,050)
of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD).

All the IDN TLDs are shown in Unicode. So are all the IDNs in the
effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k).

48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown
in punycode.

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and adjust callers.

P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome
will be updated after this CL is landed.


[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://goo.gl/L3WD1s
[5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None
[7] https://wiki.mozilla.org/IDN_Display_Algorithm
[8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them
    are blocked or mapped any way by other restrictions/mechanism in place.
    See https://bugzilla.mozilla.org/show_bug.cgi?id=1257108
[9] This is to "fix" bug 595263

BUG=336973,595263
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*
==========

to

==========
Implement a new IDN display policy

The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:

1. Use moderately restrictive rules for script mixing [1] with additional
   restrictions on mixing with Latin.
   - Script mixing is only allowed with ASCII-Latin (instead
     of any Latin) + another script allowed at the moderatate
     restriction level
2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from
   UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4].
3. Allow 5 aspirational scripts from UAX 31 [5]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
   combining mark.
6. Turn off whole script confusable check. It'd block some common
   domain labels like рф (IDN ccTLD for '.ru'),
   'bücher' (German) and 'färgbolaget' (Swedish).
7. Keep ON 'mixed script confusable' check. This is different/separate
   from 'script mixing restriction' and will catch cases like 'gօօgle'
   with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise
   allowed by rules #1 ~ #5.
8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
   mistaken as a slash. (this has been in place for a few years and is kept.)
9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008)
   encoded in punycode/ACE are always shown in Punycode. This is to make
   the display policy consistent with our prior decision to use UTS 46
   'transitional' processing (map or drop the 4 deviation characters.). [9]
10. Character black list (Mozilla's : [8]) is trimmed down to two characters.

Note that this is almost identical to Mozilla's IDN display algorithm
[7] except for #7, #8, and an additional restrictions in #1. #9 is  another
difference
because of Mozilla's use of UTS 46 'non-transitional' processing and our use of
UTS 46 'transitional' processing.

Most of domains filtered out in ".com" TLD is filtered due to the
character set restrictiction (#2 and #3) that accounts for 94% (2,050)
of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD).

All the IDN TLDs are shown in Unicode. So are all the IDNs in the
effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k).

48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown
in punycode.

P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and adjust callers.

P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome
will be updated after this CL is landed.

[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
    http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://goo.gl/L3WD1s
[5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None
[7] https://wiki.mozilla.org/IDN_Display_Algorithm
[8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them
    are blocked or mapped any way by other restrictions/mechanism in place.
    See https://bugzilla.mozilla.org/show_bug.cgi?id=1257108
[9] This is to "fix" bug 595263

BUG=336973,595263
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
     --gtest_filter=*Puny*

Committed: https://crrev.com/62a928390ba06db29576bbb32606696b3e16a66c
Cr-Commit-Position: refs/heads/master@{#382029}
==========

commit-bot: I haz the power

4 years, 9 months ago (2016-03-18 18:45:39 UTC) #82

Message was sent while issue was closed.

Patchset 29 (id:??) landed as
https://crrev.com/62a928390ba06db29576bbb32606696b3e16a66c
Cr-Commit-Position: refs/heads/master@{#382029}

Issue 1258813002: Implement a new IDN display policy (Closed)

Description

Patch Set 1 #

Patch Set 2 : use lazy instance, remove the old code #

Patch Set 3 : a little bit reorg #

Patch Set 4 : restore languages as dummy arg for now #

Patch Set 5 : restore languages in one more function #

Patch Set 6 : add back languages to one more, update comments #

Patch Set 7 : manual rebase after move to components #

Patch Set 8 : format url test fix #

Patch Set 9 : review comments addressed + alpha #

Patch Set 10 : typo fix #

Patch Set 11 : rebased #

Patch Set 12 : address review comments + fix some test (still WiP) #

Patch Set 13 : whole_script check experiment #

Patch Set 14 : back to disable wholescript conf. check + comment update #

Patch Set 15 : switch to moderately restrictive + turn off wholescirpt confusable check #

Patch Set 16 : fix typos, clean up the allowed set. #

Patch Set 17 : comment clean up #

Patch Set 18 : more tweaks + temporary log for testing #

Patch Set 19 : U+30FB/FC tweak #

Patch Set 20 : more tweaking for U+30FC/B handling #

Patch Set 21 : dangerous check - thread safety #

Patch Set 22 : deviation character check added #

Patch Set 23 : add new IDN tests and drop special casing of U+30FC/B #

Patch Set 24 : adjust two IDN test entries #

Patch Set 25 : url_canon test: wchar* needs a surrogate pair on *nix #

Patch Set 26 : review commetns addressed #

Patch Set 27 : drop U+2027 and add tests for U+2027 and U+05F4 #

Patch Set 28 : use TLS and update comments #

Patch Set 29 : more comment update per Peter #

Messages