Issue 2784933002: Mitigate spoofing attempt using Latin letters.

jungshik at Google

Description was changed from ========== Sketch of detecting potential spoofing domains within a script BUG= ...

3 years, 8 months ago (2017-04-13 20:52:10 UTC) #1

jungshik at Google

The CQ bit was checked by jshin@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-04-13 20:52:25 UTC) #2

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2784933002/20001

3 years, 8 months ago (2017-04-13 20:52:55 UTC) #3

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-04-13 20:57:55 UTC) #4

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: android_cronet on master.tryserver.chromium.android (JOB_FAILED, https://build.chromium.org/p/tryserver.chromium.android/builders/android_cronet/builds/119109) mac_chromium_compile_dbg_ng on ...

3 years, 8 months ago (2017-04-13 20:57:55 UTC) #5

jungshik at Google

The CQ bit was checked by jshin@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-04-13 21:13:04 UTC) #6

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2784933002/80001

3 years, 8 months ago (2017-04-13 21:13:59 UTC) #7

jungshik at Google

The CQ bit was checked by jshin@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-04-13 23:30:50 UTC) #8

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2784933002/120001

3 years, 8 months ago (2017-04-13 23:31:21 UTC) #9

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-04-13 23:41:32 UTC) #10

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: ios-device-xcode-clang on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device-xcode-clang/builds/75987) ios-simulator-xcode-clang on ...

3 years, 8 months ago (2017-04-13 23:41:33 UTC) #11

jungshik at Google

The CQ bit was checked by jshin@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-04-14 16:59:41 UTC) #12

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2784933002/160001

3 years, 8 months ago (2017-04-14 16:59:52 UTC) #13

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-04-14 17:08:01 UTC) #14

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: ios-simulator-xcode-clang on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator-xcode-clang/builds/80196)

3 years, 8 months ago (2017-04-14 17:08:01 UTC) #15

jungshik at Google

Description was changed from ========== Sketch of detecting potential spoofing domains within a script BUG=703750 ...

3 years, 8 months ago (2017-04-17 21:39:01 UTC) #16

jungshik at Google

The CQ bit was checked by jshin@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-04-18 00:00:05 UTC) #17

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2784933002/230001

3 years, 8 months ago (2017-04-18 00:00:30 UTC) #18

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-04-18 00:09:07 UTC) #19

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: android_cronet on master.tryserver.chromium.android (JOB_FAILED, https://build.chromium.org/p/tryserver.chromium.android/builders/android_cronet/builds/121062) cast_shell_linux on ...

3 years, 8 months ago (2017-04-18 00:09:08 UTC) #20

jungshik at Google

The CQ bit was checked by jshin@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-04-18 00:20:41 UTC) #21

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2784933002/240001

3 years, 8 months ago (2017-04-18 00:21:03 UTC) #22

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-04-18 00:28:55 UTC) #23

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: ios-simulator on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator/builds/195971) mac_chromium_compile_dbg_ng on ...

3 years, 8 months ago (2017-04-18 00:28:56 UTC) #24

jungshik at Google

The CQ bit was checked by jshin@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-04-18 09:39:46 UTC) #25

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2784933002/250001

3 years, 8 months ago (2017-04-18 09:39:56 UTC) #26

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-04-18 09:46:53 UTC) #27

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: cast_shell_linux on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/cast_shell_linux/builds/349509) ios-device on ...

3 years, 8 months ago (2017-04-18 09:46:55 UTC) #28

jungshik at Google

The CQ bit was checked by jshin@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-04-18 10:02:13 UTC) #29

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2784933002/260001

3 years, 8 months ago (2017-04-18 10:02:25 UTC) #30

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-04-18 10:09:42 UTC) #31

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: ios-device on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device/builds/193038)

3 years, 8 months ago (2017-04-18 10:09:43 UTC) #32

jungshik at Google

The CQ bit was checked by jshin@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-04-18 21:44:08 UTC) #33

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2784933002/310001

3 years, 8 months ago (2017-04-18 21:44:37 UTC) #34

jungshik at Google

Description was changed from ========== Add checks against spoofing attempt at top domains Two checks ...

3 years, 8 months ago (2017-04-18 22:56:45 UTC) #35

jungshik at Google

Description was changed from ========== Add checks against spoofing attempt at top domains Two checks ...

3 years, 8 months ago (2017-04-18 23:00:10 UTC) #36

jungshik at Google

The CQ bit was checked by jshin@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-04-19 23:14:00 UTC) #37

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2784933002/430001

3 years, 8 months ago (2017-04-19 23:14:35 UTC) #38

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-04-20 00:22:23 UTC) #39

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 8 months ago (2017-04-20 00:22:25 UTC) #40

jungshik at Google

The CQ bit was checked by jshin@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-04-20 19:34:59 UTC) #41

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2784933002/490001

3 years, 8 months ago (2017-04-20 19:35:53 UTC) #42

jungshik at Google

Description was changed from ========== Add checks against spoofing attempt at top domains Two checks ...

3 years, 8 months ago (2017-04-20 19:39:13 UTC) #43

jungshik at Google

Description was changed from ========== Add checks against spoofing attempt at top domains Two checks ...

3 years, 8 months ago (2017-04-20 19:43:20 UTC) #44

Description was changed from

==========
Add checks against spoofing attempt at top domains

Two checks are added against potential spoofing attempts.

1. Calculate the confusability skeletons of a hostname and look it up in the
   pre-calculated list of the skeletons of top N domains.

2. Remove diacritic marks from a hostname and compare it against the list of
   top N domains.

Binary file size increase: ~88kB for the combined DAFSA representation of top
domain names and their skeletons. (Two separate DAFSA for names and skeletons
takes up ~ 130kB .)

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1400 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4473)

BUG=703750
TEST=components_unittests --gtest_filter=*IDNToUni*
==========

to

==========
Add checks against spoofing attempt at top domains

Two checks are added against potential spoofing attempts.

1. Calculate the confusability skeletons of a hostname and look it up in the
   pre-calculated list of the skeletons of top 10k domains.

2. Remove diacritic marks from a hostname and compare it against the list of
   top 10k domains. This is equivalent to comparing names with the primary
   collation strength in the root locale. To make them equivalent, three
mappings
   are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.

Binary file size increase: ~88kB for the combined DAFSA representation of top
domain names and their skeletons. (Two separate DAFSA for names and skeletons
takes up ~ 130kB .)

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4534)

BUG=703750
TEST=components_unittests --gtest_filter=*IDNToUni*
==========

jungshik at Google

Description was changed from ========== Add checks against spoofing attempt at top domains Two checks ...

3 years, 8 months ago (2017-04-20 19:44:44 UTC) #45

Description was changed from

==========
Add checks against spoofing attempt at top domains

Two checks are added against potential spoofing attempts.

1. Calculate the confusability skeletons of a hostname and look it up in the
   pre-calculated list of the skeletons of top 10k domains.

2. Remove diacritic marks from a hostname and compare it against the list of
   top 10k domains. This is equivalent to comparing names with the primary
   collation strength in the root locale. To make them equivalent, three
mappings
   are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.

Binary file size increase: ~88kB for the combined DAFSA representation of top
domain names and their skeletons. (Two separate DAFSA for names and skeletons
takes up ~ 130kB .)

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4534)

BUG=703750
TEST=components_unittests --gtest_filter=*IDNToUni*
==========

to

==========
Add checks against spoofing attempt at top domains

Two checks are added against potential spoofing attempts.

1. Calculate the confusability skeletons of a hostname and look it up in the
   pre-calculated list of the skeletons of top 10k domains.

2. Remove diacritic marks from a hostname and compare it against the list of
   top 10k domains. This is equivalent to comparing names with the primary
   collation strength in the root locale. To make them equivalent, three
mappings
   are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.

Binary file size increase: ~88kB for the combined DAFSA representation of top
domain names and their skeletons. (Two separate DAFSA for names and skeletons
takes up ~ 130kB .)

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4534)

To use DAFSA-related codes and tools from components/, they're moved from net/
to base/.

BUG=703750
TEST=components_unittests --gtest_filter=*IDNToUni*
==========

jungshik at Google

jshin@chromium.org changed reviewers: + rsleevi@chromium.org

3 years, 8 months ago (2017-04-20 19:57:21 UTC) #46

jungshik at Google

Ryan, can you take a look? A lot of files outside components/url_format are due to ...

3 years, 8 months ago (2017-04-20 19:57:23 UTC) #47

jungshik at Google

jshin@chromium.org changed reviewers: + pkasting@chromium.org

3 years, 8 months ago (2017-04-20 19:58:20 UTC) #48

jungshik at Google

Adding Peter to the reviewers list now. Peter, please take a look once you're back. ...

3 years, 8 months ago (2017-04-20 19:58:21 UTC) #49

Ryan Sleevi

rsleevi@chromium.org changed reviewers: + nick@chromium.org, thakis@chromium.org

3 years, 8 months ago (2017-04-20 20:04:48 UTC) #50

Ryan Sleevi

Going to add two other reviewers, since I think their feedback is important enough to ...

3 years, 8 months ago (2017-04-20 20:04:50 UTC) #51

Ryan Sleevi

And for context: If this goal is solely for .com, we could always download the ...

3 years, 8 months ago (2017-04-20 20:07:12 UTC) #52

Nico

On 2017/04/20 20:04:50, Ryan Sleevi wrote: > Going to add two other reviewers, since I ...

3 years, 8 months ago (2017-04-20 20:12:00 UTC) #53

jungshik at Google

Thanks, Ryan ! I didn't realize that components/ can depend on net/ (in retrospect, it's ...

3 years, 8 months ago (2017-04-20 20:12:34 UTC) #54

jungshik at Google

On 2017/04/20 20:12:00, Nico wrote: > On 2017/04/20 20:04:50, Ryan Sleevi wrote: > > Going ...

3 years, 8 months ago (2017-04-20 20:21:21 UTC) #55

jungshik at Google

On 2017/04/20 20:07:12, Ryan Sleevi wrote: > And for context: If this goal is solely ...

3 years, 8 months ago (2017-04-20 20:37:29 UTC) #56

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-04-20 21:07:10 UTC) #57

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 8 months ago (2017-04-20 21:07:12 UTC) #58

ncarter (slow)

On 2017/04/20 20:37:29, jungshik at Google wrote: > On 2017/04/20 20:07:12, Ryan Sleevi wrote: > ...

3 years, 8 months ago (2017-04-20 22:12:13 UTC) #59

ncarter (slow)

https://codereview.chromium.org/2784933002/diff/490001/components/url_formatter/top_domains/README File components/url_formatter/top_domains/README (right): https://codereview.chromium.org/2784933002/diff/490001/components/url_formatter/top_domains/README#newcode5 components/url_formatter/top_domains/README:5: src/tools/perf/page_sets/alexa1-10000-urls.json by running the following: IIRC the alexa10000 from ...

3 years, 8 months ago (2017-04-20 22:26:59 UTC) #60

ncarter (slow)

https://codereview.chromium.org/2784933002/diff/490001/components/url_formatter/url_formatter.cc File components/url_formatter/url_formatter.cc (right): https://codereview.chromium.org/2784933002/diff/490001/components/url_formatter/url_formatter.cc#newcode557 components/url_formatter/url_formatter.cc:557: if (GetSkeleton(hostname, &skeleton) && LookupMatchInTopDomains(skeleton, 0)) If it's possible ...

3 years, 8 months ago (2017-04-20 23:37:22 UTC) #61

jungshik at Google

https://codereview.chromium.org/2784933002/diff/490001/components/url_formatter/top_domains/alexa_10k_names_and_skeletons.gperf File components/url_formatter/top_domains/alexa_10k_names_and_skeletons.gperf (right): https://codereview.chromium.org/2784933002/diff/490001/components/url_formatter/top_domains/alexa_10k_names_and_skeletons.gperf#newcode83 components/url_formatter/top_domains/alexa_10k_names_and_skeletons.gperf:83: microsoft.com, 1 On 2017/04/20 22:26:59, ncarter wrote: > Here ...

3 years, 8 months ago (2017-04-21 19:31:47 UTC) #62

jungshik at Google

https://codereview.chromium.org/2784933002/diff/490001/components/url_formatter/url_formatter.cc File components/url_formatter/url_formatter.cc (right): https://codereview.chromium.org/2784933002/diff/490001/components/url_formatter/url_formatter.cc#newcode557 components/url_formatter/url_formatter.cc:557: if (GetSkeleton(hostname, &skeleton) && LookupMatchInTopDomains(skeleton, 0)) On 2017/04/20 23:37:22, ...

3 years, 8 months ago (2017-04-21 20:16:33 UTC) #63

https://codereview.chromium.org/2784933002/diff/490001/components/url_formatt...
File components/url_formatter/url_formatter.cc (right):

https://codereview.chromium.org/2784933002/diff/490001/components/url_formatt...
components/url_formatter/url_formatter.cc:557: if (GetSkeleton(hostname,
&skeleton) && LookupMatchInTopDomains(skeleton, 0))
On 2017/04/20 23:37:22, ncarter wrote:
> If it's possible to efficiently obtain the skeleton (or do the diacritical
> removal) one input character at a time, then it might be worthwhile to use
> FixedSetIncrementalLookup here: we'd probably be done after one or two
> characters most of the time, since the first multibyte UTF-8 character would
> exhaust the DAFSA search.

Yeah, I saw that API and thought about trying one character at a time, but
didn't try. 

AccentRemoval is rather expensive. Before switching over to DAFSA, I used
sortkey with primary strength - meaning accents are ignored when generating
sortkey. It's faster to calculate sortkey, but sortkey uses almost all the bytes
( 0x00 ~ 0xFF) so that DAFSA cannot be used for sortkey look-up. OTOH, not being
able to use DAFSA means that I have to initialize hash table of sortkeys at
runtime. 

Moreover, I found about dozens of domains registered in '.com' with 'base Latin
letter + combining mark' (there's no precomposed form for those sequences so
that normalization wouldn't get rid of them). One character at a time
accent-removal would not work for them.

As for the skeleton calculation, it should be (a lot) cheaper than
accent-removal. again 'combining marks' above are problematic for one character
at a time calculation. 

Perhaps, I can branch depending on whether there's a combing mark in the input.
If there is, use the current approach. Otherwise, take a fast path for skeleton
lookup.

https://codereview.chromium.org/2784933002/diff/490001/components/url_formatt...
components/url_formatter/url_formatter.cc:562:
LookupMatchInTopDomains(accent_free_name, 1);
On 2017/04/20 23:37:22, ncarter wrote:
> Should we (could we) do GetSkeleton() on the RemoveDiacritics string? (Is
> http://xn--ricrosoft-l6a.com a concern)?

That's a good one !! Thank you for the suggestion. I haven't thought about it.
(and ..... ick ...) 

In terms of performance, though, we might lose more ground by accent-removal
first (accent-removal is slow.) than gaining thanks to a faster DAFSA lookup. 

> If we did, then we might be able to reduce DAFSA size (only needing to store
the
> skeletons) -- a smaller DAFSA is a faster DAFSA.

Related to that is how much saving we'd get if we can have a'value-free' DAFSA
(if we just store skeletons, the only thing I want is a membership. I don't care
about values). 

> Also, as above, it's worth asking if FixedSetIncrementalLookup could provide a
> speedup here.

Yeah..

BTW, before trying to optimize further, I'd rather have a sense of how critical
speed up is for URL formatting in omnibox and elsewhere.

Ryan Sleevi

On 2017/04/21 20:16:33, jungshik at Google wrote: > > Also, as above, it's worth asking ...

3 years, 8 months ago (2017-04-21 20:24:16 UTC) #64

ncarter (slow)

https://codereview.chromium.org/2784933002/diff/490001/components/url_formatter/top_domains/alexa_10k_names_and_skeletons.gperf File components/url_formatter/top_domains/alexa_10k_names_and_skeletons.gperf (right): https://codereview.chromium.org/2784933002/diff/490001/components/url_formatter/top_domains/alexa_10k_names_and_skeletons.gperf#newcode10 components/url_formatter/top_domains/alexa_10k_names_and_skeletons.gperf:10: // 1: original domain name. Should this file be ...

3 years, 8 months ago (2017-04-21 21:35:34 UTC) #65

jungshik at Google

The CQ bit was checked by jshin@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-04-21 23:07:47 UTC) #66

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2784933002/530001

3 years, 8 months ago (2017-04-21 23:08:35 UTC) #67

jungshik at Google

On 2017/04/21 21:35:34, ncarter wrote: > https://codereview.chromium.org/2784933002/diff/490001/components/url_formatter/top_domains/alexa_10k_names_and_skeletons.gperf > File components/url_formatter/top_domains/alexa_10k_names_and_skeletons.gperf > (right): > > https://codereview.chromium.org/2784933002/diff/490001/components/url_formatter/top_domains/alexa_10k_names_and_skeletons.gperf#newcode10 ...

3 years, 8 months ago (2017-04-21 23:16:13 UTC) #68

On 2017/04/21 21:35:34, ncarter wrote:
>
https://codereview.chromium.org/2784933002/diff/490001/components/url_formatt...
> File components/url_formatter/top_domains/alexa_10k_names_and_skeletons.gperf
> (right):
> 
>
https://codereview.chromium.org/2784933002/diff/490001/components/url_formatt...
> components/url_formatter/top_domains/alexa_10k_names_and_skeletons.gperf:10:
// 
>  1: original domain name.
> Should this file be generated at build time? I'm wondering what happens when
we
> update ICU -- the ICU docs caution against caching skeletons.
> 
> For that matter, does chrome always use a bundled copy of ICU, or does it ever
> use a system copy?

Google Chrome always uses a bundled ICU, but Chromium can use a system ICU. And,
you're right that the skeleton can change with an ICU version even though the
skeleton of
ASCII letters are not likely to change (all the alexa top 10k are ASCII only
except for one which I filtered out). 

Nonetheless, generating a gperf at runtime seems to be a good idea. I'll see how
to tweak BUILD.gn for that. 

 
>
https://codereview.chromium.org/2784933002/diff/490001/components/url_formatt...
> components/url_formatter/top_domains/alexa_10k_names_and_skeletons.gperf:867:
> google.no, 1
> Cases where a string is listed twice, like google.no, probably don't actually
> work properly. LookupStringInFixedSet can't return both 0 and 1 as the lookup
> result, even if the DAFSA encodes multiple results.
> 
> Instead, you should probably treat the two cases as a bitfield: 1 means
> skeleton, 2 means entry, and 3 means both, like google.no. This is how the
> effective TLD dafsa does it.

Thanks a lot for catching it. I'll fix that. 

> 
> We should probably have make_dafsa.py's parse_gperf function complain about
> duplicate keys. It may actually be emitting a dafsa that encodes both result
> values, but our decoder isn't set up to look at them.
> 
> [ALSO: sleevi -- this would have been caught by the "decoded dafsa should be
> equal to the input gperf" unittest that I wanted to add!]

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-04-22 00:32:35 UTC) #69

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 8 months ago (2017-04-22 00:32:37 UTC) #70

Peter Kasting

There are a lot of reviewers on this, most of whom seem more capable than ...

3 years, 8 months ago (2017-04-24 22:00:57 UTC) #71

jungshik at Google

The CQ bit was checked by jshin@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-04-24 22:02:17 UTC) #72

jungshik at Google

Description was changed from ========== Add checks against spoofing attempt at top domains Two checks ...

3 years, 8 months ago (2017-04-24 22:03:12 UTC) #73

Description was changed from

==========
Add checks against spoofing attempt at top domains

Two checks are added against potential spoofing attempts.

1. Calculate the confusability skeletons of a hostname and look it up in the
   pre-calculated list of the skeletons of top 10k domains.

2. Remove diacritic marks from a hostname and compare it against the list of
   top 10k domains. This is equivalent to comparing names with the primary
   collation strength in the root locale. To make them equivalent, three
mappings
   are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.

Binary file size increase: ~88kB for the combined DAFSA representation of top
domain names and their skeletons. (Two separate DAFSA for names and skeletons
takes up ~ 130kB .)

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4534)

To use DAFSA-related codes and tools from components/, they're moved from net/
to base/.

BUG=703750
TEST=components_unittests --gtest_filter=*IDNToUni*
==========

to

==========
Add checks against spoofing attempt at top domains

Two checks are added against potential spoofing attempts.

1. Calculate the confusability skeletons of a hostname and look it up in the
   pre-calculated list of the skeletons of top 10k domains.

2. Remove diacritic marks from a hostname and compare it against the list of
   top 10k domains. This is equivalent to comparing names with the primary
   collation strength in the root locale. To make them equivalent, three
mappings
   are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.

Binary file size increase: ~ 83kB for the combined DAFSA representation of top
domain names and their skeletons. (Two separate DAFSA for names and skeletons
takes up ~ 130kB .)

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4534)

To use DAFSA-related codes and tools from components/, they're moved from net/
to base/.

BUG=703750
TEST=components_unittests --gtest_filter=*IDNToUni*
==========

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2784933002/570001

3 years, 8 months ago (2017-04-24 22:03:41 UTC) #74

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-04-24 23:30:31 UTC) #75

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 8 months ago (2017-04-24 23:30:32 UTC) #76

jungshik at Google

jshin@chromium.org changed reviewers: - thakis@chromium.org

3 years, 8 months ago (2017-04-25 05:28:09 UTC) #77

jungshik at Google

Peter, I need your input on performance. On average (as tested with ~ 1 million ...

3 years, 8 months ago (2017-04-25 05:28:11 UTC) #78

jungshik at Google

The CQ bit was checked by jshin@chromium.org to run a CQ dry run

3 years, 7 months ago (2017-04-25 20:20:12 UTC) #79

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2784933002/610001

3 years, 7 months ago (2017-04-25 20:21:14 UTC) #80

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 7 months ago (2017-04-25 21:57:10 UTC) #81

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 7 months ago (2017-04-25 21:57:12 UTC) #82

Peter Kasting

On 2017/04/25 05:28:11, jungshik at Google wrote: > Peter, I need your input on performance. ...

3 years, 7 months ago (2017-04-25 22:04:02 UTC) #83

jungshik at Google

On 2017/04/25 22:04:02, Peter Kasting wrote: > On 2017/04/25 05:28:11, jungshik at Google wrote: > ...

3 years, 7 months ago (2017-04-26 00:30:50 UTC) #84

ncarter (slow)

On 2017/04/26 00:30:50, jungshik at Google wrote: > On 2017/04/25 22:04:02, Peter Kasting wrote: > ...

3 years, 7 months ago (2017-04-26 17:28:23 UTC) #85

On 2017/04/26 00:30:50, jungshik at Google wrote:
> On 2017/04/25 22:04:02, Peter Kasting wrote:
> > On 2017/04/25 05:28:11, jungshik at Google wrote:
> > > Peter, I need your input on performance. On average (as tested with ~ 1
> > million
> > > IDNs with .com), this check increases the time to determine whether or not
> to
> > > display in Unicode by 7~80%.  I believe URL formatting is not in a
> > perf-critical
> > > path, but I'm not sure of its interaction with what's done in omnibox.
> > 
> > I _think_ this isn't in a perf-critical path.  It's in some codepaths that
> look
> > at first like they're perf-critical for the omnibox, but on further digging
I
> > think we're likely OK.  However, if you land this, I would try and
coordinate
> > with mpearson on which histograms to watch to catch perf regressions in the
> > field.
> 
> Thanks a lot, Peter !
> 
> I collected some data to see how the number of IDNs flagged (out of 1 million
> com IDNs), average runtime (for 1 million
> IDNs) and the size of DAFSA change with # of top domains included in the
> checklist. They're available at
> 
>  
>
https://docs.google.com/spreadsheets/d/1hGMaEIUPRDP8Io7jIgh_lpQUrM99EpUbOphmR...
> 
> The check time is almost flat (does not change with # of top domains in the
> DAFSA). Apparently, it's dominated by accent-removal (and
> skeleton calculation; I suspect accent-removal takes up most of time). 
> 
> Given what Peter wrote (not a perf-critical path) and the above finding, the
> only thing remaining to consider is the memory impact.
> 
> If you use top 10k, it's 83kB. With top 500, it's ~ 4kB. With top 3000, it's
> ~27kB. Given the slope of the memory vs # of flagged domains,
> I'm tempted to cut off at 3,000 (27kB memory / binary size impact). 
> 
> I wonder what others think of that cut-off.

Because 'reject' is the common result of the DAFSA lookup, it's likely only
going to explore the first few nodes. In fact, if the first character of the
query string is a multibyte UTF-8 character,  we'll never even dereference a
single byte of the DAFSA at all. I expect that to be true for most IDNs -- maybe
less so for cyrillic since it has a lot of ascii-confusable characters.

If that's right, I would expect the DAFSA part of the performance to be near
constant as the dafsa grows in size. As you add bigrams, the consecutive offsets
of the first node will get more spaced out in memory so scanning the node will
eventually hit more and more cache lines. Our current DAFSA layout is optimized
for space, at some cost to scanning efficiency.

For the skeleton analysis, I wonder: would we gain anything by excluding domains
that provably have no single-script skeleton equivalent? I'm not saying I know
how to compute this, but I believe it is be computable. For example,
http://unicode.org/cldr/utility/confusables.jsp?a=abcdefghijklmnop-dz&r=IDNA2008
suggests that lowercase k doesn't have any confusables.

ncarter (slow)

the dafsa usage seems correct, so lgtm from that perspective https://codereview.chromium.org/2784933002/diff/610001/components/url_formatter/url_formatter.cc File components/url_formatter/url_formatter.cc (right): https://codereview.chromium.org/2784933002/diff/610001/components/url_formatter/url_formatter.cc#newcode214 ...

3 years, 7 months ago (2017-04-26 18:46:29 UTC) #86

jungshik at Google

Thank you for the explanation and suggestion. > On 2017/04/26 00:30:50, jungshik at Google wrote: ...

3 years, 7 months ago (2017-04-26 19:12:38 UTC) #87

Thank you for the explanation and suggestion. 

> On 2017/04/26 00:30:50, jungshik at Google wrote:

>
https://docs.google.com/spreadsheets/d/1hGMaEIUPRDP8Io7jIgh_lpQUrM99EpUbOphmR...
> > 
> > The check time is almost flat (does not change with # of top domains in the
> > DAFSA). Apparently, it's dominated by accent-removal (and
> > skeleton calculation; I suspect accent-removal takes up most of time). 
> > 
> > Given what Peter wrote (not a perf-critical path) and the above finding, the
> > only thing remaining to consider is the memory impact.
> > 
> > If you use top 10k, it's 83kB. With top 500, it's ~ 4kB. With top 3000, it's
> > ~27kB. Given the slope of the memory vs # of flagged domains,
> > I'm tempted to cut off at 3,000 (27kB memory / binary size impact). 
> > 
> > I wonder what others think of that cut-off.
> 
> Because 'reject' is the common result of the DAFSA lookup, it's likely only
> going to explore the first few nodes. In fact, if the first character of the
> query string is a multibyte UTF-8 character,  we'll never even dereference a
> single byte of the DAFSA at all. I expect that to be true for most IDNs --
maybe
> less so for cyrillic since it has a lot of ascii-confusable characters.


> If that's right, I would expect the DAFSA part of the performance to be near
> constant as the dafsa grows in size. As you add bigrams, the consecutive
offsets
> of the first node will get more spaced out in memory so scanning the node will
> eventually hit more and more cache lines. Our current DAFSA layout is
optimized
> for space, at some cost to scanning efficiency.
> 
> For the skeleton analysis, I wonder: would we gain anything by excluding
domains
> that provably have no single-script skeleton equivalent? I'm not saying I know
> how to compute this, but I believe it is be computable. For example,
>
http://unicode.org/cldr/utility/confusables.jsp?a=abcdefghijklmnop-dz&r=IDNA2008
> suggests that lowercase k doesn't have any confusables.

You want to try it for the space efficiency, don't you? (As we discussed,
there's little worry about speed because 1) DAFSA scanning contributes very
little to perf 2) the code in question is not
in a perf-critical path.) 

To compute that, I need to scrape the Unicode confusables data [1]
(there's no ICU API to get the list of confusables for a given character) and
traverse over all possible confusables combinations for a given label to see if
any of
them is a single script (i.e. if any of them would not be flagged as
mixed-script). That will be interesting, but I don't expect a drastic reduction.


[1] http://www.unicode.org/Public/security/9.0.0/confusables.txt : has 736
entries for LDH in ASCII, but a lot of them are outside the allowed character
set.

jungshik at Google

Thank you for LGTM. Addressed the comments in the latest PS. https://codereview.chromium.org/2784933002/diff/610001/components/url_formatter/url_formatter.cc File components/url_formatter/url_formatter.cc (right): ...

3 years, 7 months ago (2017-04-26 19:36:21 UTC) #88

ncarter (slow)

On 2017/04/26 19:12:38, jungshik at Google wrote: > Thank you for the explanation and suggestion. ...

3 years, 7 months ago (2017-04-26 20:20:40 UTC) #89

On 2017/04/26 19:12:38, jungshik at Google wrote:
> Thank you for the explanation and suggestion. 
> 
> > On 2017/04/26 00:30:50, jungshik at Google wrote:
> 
> >
>
https://docs.google.com/spreadsheets/d/1hGMaEIUPRDP8Io7jIgh_lpQUrM99EpUbOphmR...
> > > 
> > > The check time is almost flat (does not change with # of top domains in
the
> > > DAFSA). Apparently, it's dominated by accent-removal (and
> > > skeleton calculation; I suspect accent-removal takes up most of time). 
> > > 
> > > Given what Peter wrote (not a perf-critical path) and the above finding,
the
> > > only thing remaining to consider is the memory impact.
> > > 
> > > If you use top 10k, it's 83kB. With top 500, it's ~ 4kB. With top 3000,
it's
> > > ~27kB. Given the slope of the memory vs # of flagged domains,
> > > I'm tempted to cut off at 3,000 (27kB memory / binary size impact). 
> > > 
> > > I wonder what others think of that cut-off.
> > 
> > Because 'reject' is the common result of the DAFSA lookup, it's likely only
> > going to explore the first few nodes. In fact, if the first character of the
> > query string is a multibyte UTF-8 character,  we'll never even dereference a
> > single byte of the DAFSA at all. I expect that to be true for most IDNs --
> maybe
> > less so for cyrillic since it has a lot of ascii-confusable characters.
> 
> 
> > If that's right, I would expect the DAFSA part of the performance to be near
> > constant as the dafsa grows in size. As you add bigrams, the consecutive
> offsets
> > of the first node will get more spaced out in memory so scanning the node
will
> > eventually hit more and more cache lines. Our current DAFSA layout is
> optimized
> > for space, at some cost to scanning efficiency.
> > 
> > For the skeleton analysis, I wonder: would we gain anything by excluding
> domains
> > that provably have no single-script skeleton equivalent? I'm not saying I
know
> > how to compute this, but I believe it is be computable. For example,
> >
>
http://unicode.org/cldr/utility/confusables.jsp?a=abcdefghijklmnop-dz&r=IDNA2008
> > suggests that lowercase k doesn't have any confusables.
> 
> You want to try it for the space efficiency, don't you? (As we discussed,
> there's little worry about speed because 1) DAFSA scanning contributes very
> little to perf 2) the code in question is not
> in a perf-critical path.) 
> 
> To compute that, I need to scrape the Unicode confusables data [1]
> (there's no ICU API to get the list of confusables for a given character) and
> traverse over all possible confusables combinations for a given label to see
if
> any of
> them is a single script (i.e. if any of them would not be flagged as
> mixed-script). That will be interesting, but I don't expect a drastic
reduction.
> 
> 
> [1] http://www.unicode.org/Public/security/9.0.0/confusables.txt : has 736
> entries for LDH in ASCII, but a lot of them are outside the allowed character
> set.

Right, I was only concerned about space (or really, about coverage: with a
given space budget, if we can exclude some useless entries, that makes room
for other vulnerable sites to take their place). But if pretty much
every entry is potentially spoofable via diacriticals, that's not too much
DAFSA size to shed, and the idea probably isn't worth exploring.

FWIW, I agree, I see no simple ICU way to explore the mapping of skeleton
to single-script confusable string either. And if we did build such a tool,
it would have some potentially nefarious uses.

jungshik at Google

Description was changed from ========== Add checks against spoofing attempt at top domains Two checks ...

3 years, 7 months ago (2017-05-02 23:36:53 UTC) #90

Description was changed from

==========
Add checks against spoofing attempt at top domains

Two checks are added against potential spoofing attempts.

1. Calculate the confusability skeletons of a hostname and look it up in the
   pre-calculated list of the skeletons of top 10k domains.

2. Remove diacritic marks from a hostname and compare it against the list of
   top 10k domains. This is equivalent to comparing names with the primary
   collation strength in the root locale. To make them equivalent, three
mappings
   are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.

Binary file size increase: ~ 83kB for the combined DAFSA representation of top
domain names and their skeletons. (Two separate DAFSA for names and skeletons
takes up ~ 130kB .)

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4534)

To use DAFSA-related codes and tools from components/, they're moved from net/
to base/.

BUG=703750
TEST=components_unittests --gtest_filter=*IDNToUni*
==========

to

==========
Add checks against spoofing attempt at top domains

Remove diacritic marks from a hostname and calculate the confusability
skeleton of the accent-free name. Look it up in the pre-calculated list of
the skeletons of top 10k domains.

Removing diacritic marks from a hostname is equivalent to comparing names with
the primary collation strength in the root locale. To make them equivalent,
three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.

Binary file size increase: ~ 59kB for the DAFSA representation of top
domain name skeletons.

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4595)

In addition, disallow combining diarctic marks unless they're preceded by
Latin-Greek-Cyrillic.

BUG=703750
TEST=components_unittests --gtest_filter=*IDNToUni*
==========

jungshik at Google

Description was changed from ========== Add checks against spoofing attempt at top domains Remove diacritic ...

3 years, 7 months ago (2017-05-02 23:45:12 UTC) #91

Description was changed from

==========
Add checks against spoofing attempt at top domains

Remove diacritic marks from a hostname and calculate the confusability
skeleton of the accent-free name. Look it up in the pre-calculated list of
the skeletons of top 10k domains.

Removing diacritic marks from a hostname is equivalent to comparing names with
the primary collation strength in the root locale. To make them equivalent,
three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.

Binary file size increase: ~ 59kB for the DAFSA representation of top
domain name skeletons.

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4595)

In addition, disallow combining diarctic marks unless they're preceded by
Latin-Greek-Cyrillic.

BUG=703750
TEST=components_unittests --gtest_filter=*IDNToUni*
==========

to

==========
Add checks against spoofing attempt at top domains

Remove diacritic marks from a hostname and calculate the confusability
skeleton of the accent-free name. Look it up in the pre-calculated list of
the skeletons of top 10k domains.

Removing diacritic marks from a hostname is equivalent to comparing names with
the primary collation strength in the root locale. To make them equivalent,
three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.
Also add two more mappings (ӏ -> l; к -> k) to supplement the Unicode's
confusables list.

Binary file size increase: ~ 59kB for the DAFSA representation of top
domain name skeletons.

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4595)

In addition, disallow combining diarctic marks unless they're preceded by
Latin-Greek-Cyrillic.

BUG=703750,714628
TEST=components_unittests --gtest_filter=*IDNToUni*
==========

jungshik at Google

On 2017/04/26 20:20:40, ncarter wrote: > On 2017/04/26 19:12:38, jungshik at Google wrote: > > ...

3 years, 7 months ago (2017-05-03 00:25:18 UTC) #92

On 2017/04/26 20:20:40, ncarter wrote:
> On 2017/04/26 19:12:38, jungshik at Google wrote:
> > Thank you for the explanation and suggestion. 

> > You want to try it for the space efficiency, don't you? (As we discussed,
> > there's little worry about speed because 1) DAFSA scanning contributes very
> > little to perf 2) the code in question is not
> > in a perf-critical path.) 

> Right, I was only concerned about space (or really, about coverage: with a
> given space budget, if we can exclude some useless entries, that makes room
> for other vulnerable sites to take their place). But if pretty much
> every entry is potentially spoofable via diacriticals, that's not too much
> DAFSA size to shed, and the idea probably isn't worth exploring.
> 
> FWIW, I agree, I see no simple ICU way to explore the mapping of skeleton
> to single-script confusable string either. And if we did build such a tool,
> it would have some potentially nefarious uses.

Thank you for the clarification. I agree that it'd not be wise to publish that,
but somebody apparently did on github. 

Before moving forward, I wanted to make sure that using Alexa list is ok and
it's taken care of. 

In the meantime (while waiting for the answer to the above), bug 714628 came up
and made me revise this CL. Basically, I follow your earlier proposal to remove
diacritics before
calculating the skeleton. After that, the look-up is only done in the skeletons
of top domains (instead of two separate look-ups). That cuts down the data size
by 23kB for top 10k domains (83kB -> ~60kB). 
The performance is about 10% better (perhaps not mainly  because of savings in
look-up but because of some other accompanied changes, I guess). 
An additional benefit is that the code got simpler. 

https://docs.google.com/spreadsheets/d/1hGMaEIUPRDP8Io7jIgh_lpQUrM99EpUbOphmR...
: PS37 tab has the information. 

Anyway, can you take another look?  

Ryan, can you also take a look?   

We also have to decide how many top domains to include in the list?  As shown in
the spreadsheet, adding more domains (increasing the DAFSA size)  has
diminishing return.

jungshik at Google

The CQ bit was checked by jshin@chromium.org to run a CQ dry run

3 years, 7 months ago (2017-05-03 19:46:33 UTC) #93

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2784933002/670001

3 years, 7 months ago (2017-05-03 19:46:58 UTC) #94

jungshik at Google

Description was changed from ========== Add checks against spoofing attempt at top domains Remove diacritic ...

3 years, 7 months ago (2017-05-03 19:49:32 UTC) #95

Description was changed from

==========
Add checks against spoofing attempt at top domains

Remove diacritic marks from a hostname and calculate the confusability
skeleton of the accent-free name. Look it up in the pre-calculated list of
the skeletons of top 10k domains.

Removing diacritic marks from a hostname is equivalent to comparing names with
the primary collation strength in the root locale. To make them equivalent,
three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.
Also add two more mappings (ӏ -> l; к -> k) to supplement the Unicode's
confusables list.

Binary file size increase: ~ 59kB for the DAFSA representation of top
domain name skeletons.

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4595)

In addition, disallow combining diarctic marks unless they're preceded by
Latin-Greek-Cyrillic.

BUG=703750,714628
TEST=components_unittests --gtest_filter=*IDNToUni*
==========

to

==========
Add checks against spoofing attempt at top domains

Remove diacritic marks from a hostname and calculate the confusability
skeleton of the accent-free name. Look it up in the pre-calculated list of
the skeletons of top 10k domains.

Removing diacritic marks from a hostname is equivalent to comparing names with
the primary collation strength in the root locale. To make them equivalent,
three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.
Also add two more mappings ([кĸκ] > k,  п > n) to supplement the Unicode's
confusables list.

Binary file size increase: ~ 59kB for the DAFSA representation of top
domain name skeletons.

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4571)

In addition, disallow combining diarctic marks unless they're preceded by
Latin-Greek-Cyrillic.

BUG=703750,714628
TEST=components_unittests --gtest_filter=*IDNToUni*
==========

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 7 months ago (2017-05-03 20:29:43 UTC) #96

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_tsan_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_tsan_rel_ng/builds/66301)

3 years, 7 months ago (2017-05-03 20:29:45 UTC) #97

jungshik at Google

The CQ bit was checked by jshin@chromium.org to run a CQ dry run

3 years, 7 months ago (2017-05-03 22:43:02 UTC) #98

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2784933002/710001

3 years, 7 months ago (2017-05-03 22:43:46 UTC) #99

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 7 months ago (2017-05-03 23:47:56 UTC) #100

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 7 months ago (2017-05-03 23:47:58 UTC) #101

jungshik at Google

The CQ bit was checked by jshin@chromium.org to run a CQ dry run

3 years, 7 months ago (2017-05-05 23:02:47 UTC) #102

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2784933002/730001

3 years, 7 months ago (2017-05-05 23:05:13 UTC) #103

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 7 months ago (2017-05-06 00:58:48 UTC) #104

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 7 months ago (2017-05-06 00:58:50 UTC) #105

jungshik at Google

Description was changed from ========== Add checks against spoofing attempt at top domains Remove diacritic ...

3 years, 7 months ago (2017-05-08 21:12:33 UTC) #106

Description was changed from

==========
Add checks against spoofing attempt at top domains

Remove diacritic marks from a hostname and calculate the confusability
skeleton of the accent-free name. Look it up in the pre-calculated list of
the skeletons of top 10k domains.

Removing diacritic marks from a hostname is equivalent to comparing names with
the primary collation strength in the root locale. To make them equivalent,
three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.
Also add two more mappings ([кĸκ] > k,  п > n) to supplement the Unicode's
confusables list.

Binary file size increase: ~ 59kB for the DAFSA representation of top
domain name skeletons.

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4571)

In addition, disallow combining diarctic marks unless they're preceded by
Latin-Greek-Cyrillic.

BUG=703750,714628
TEST=components_unittests --gtest_filter=*IDNToUni*
==========

to

==========
Add checks against spoofing attempt at top domains

Remove diacritic marks from a hostname and calculate the confusability
skeleton of the accent-free name. Look it up in the pre-calculated list of
the skeletons of top 10k domains.

Removing diacritic marks from a hostname is equivalent to comparing names with
the primary collation strength in the root locale. To make them equivalent,
three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.
Also add two more mappings ([кĸκ] > k,  п > n) to supplement the Unicode's
confusables list.

Binary file size increase: ~ 59kB for the DAFSA representation of top
domain name skeletons.

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4571)

In addition, disallow combining diarctic marks unless they're preceded by
Latin-Greek-Cyrillic.

BUG=703750,714628,719199
TEST=components_unittests --gtest_filter=*IDNToUni*
==========

jungshik at Google

Peter, I also need your owner review/approval as well. Thank you, This is not perfect ...

3 years, 7 months ago (2017-05-08 23:36:01 UTC) #107

jungshik at Google

Oops. The beginning of my reply is lost in copy'n'paste. Ryan, can you take a ...

3 years, 7 months ago (2017-05-08 23:37:30 UTC) #108

Ryan Sleevi

rsleevi@chromium.org changed reviewers: + brettw@chromium.org, emilyschechter@chromium.org

3 years, 7 months ago (2017-05-09 00:00:25 UTC) #109

Ryan Sleevi

I'm not sure I feel qualified enough to review this from a policy side, and ...

3 years, 7 months ago (2017-05-09 00:00:29 UTC) #110

I'm not sure I feel qualified enough to review this from a policy side, and I
think Peter's probably plenty coverage enough from a technical side.

I've added Brett for the GN changes (and as he's a backup OWNER), but my
understanding is that your goal is to have this checked in as a single binary
(like tld_cleanup) and run manually whenever the tests Alexa Top 10K changes.

I added Emily to make sure it's got the visibility - from all the threads, I
couldn't quite figure out who 'owned' the decision, but if anyone knows, it'll
be Emily. I chatted with palmer@ to make sure it wasn't him on a technical
front, and he confirmed.

So:
- Peter for impl
- Brett for GN best practices
- Emily for go/no-go

I think that covers it? :)

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
File components/url_formatter/top_domains/BUILD.gn (right):

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/top_domains/BUILD.gn:21:
executable("make_top_domain_gperf") {
I would defer to brettw here whether there's any toolchain tricks needed here to
make sure it generates something that will run on the host architecture during
cross-compilation to generate the target dataset...

Or is this just intended to be run manually?

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
File components/url_formatter/top_domains/make_alexa_top_list.py (right):

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/top_domains/make_alexa_top_list.py:21: "page_sets",
"alexa1-10000-urls.json")
It seems like this should be an input parameter to the script, so that the
dependency can be captured by BUILD.gn

Did I miss a reply to the concerns previously raised that this list is very out
of date?

You may wish to check with OSPO on this, since it'll be shipped in the Chrome
binary, which is different than how it's being used today.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/top_domains/make_alexa_top_list.py:46: # Add some
popular domains if they're missing.
I'm not sure why this part - it seems to be more subjective?

Peter Kasting

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatter/top_domains/README File components/url_formatter/top_domains/README (right): https://codereview.chromium.org/2784933002/diff/730001/components/url_formatter/top_domains/README#newcode2 components/url_formatter/top_domains/README:2: It is an input to make_top_domain_list and is made ...

3 years, 7 months ago (2017-05-09 01:37:04 UTC) #111

jungshik at Google

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatter/top_domains/BUILD.gn File components/url_formatter/top_domains/BUILD.gn (right): https://codereview.chromium.org/2784933002/diff/730001/components/url_formatter/top_domains/BUILD.gn#newcode21 components/url_formatter/top_domains/BUILD.gn:21: executable("make_top_domain_gperf") { On 2017/05/09 00:00:29, Ryan Sleevi wrote: > ...

3 years, 7 months ago (2017-05-09 19:57:39 UTC) #112

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
File components/url_formatter/top_domains/BUILD.gn (right):

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/top_domains/BUILD.gn:21:
executable("make_top_domain_gperf") {
On 2017/05/09 00:00:29, Ryan Sleevi wrote:
> I would defer to brettw here whether there's any toolchain tricks needed here
to
> make sure it generates something that will run on the host architecture during
> cross-compilation to generate the target dataset...
> 
> Or is this just intended to be run manually?

Yes, that's to be run manually. That's why is_ios and is_android are excluded.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
File components/url_formatter/top_domains/make_alexa_top_list.py (right):

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/top_domains/make_alexa_top_list.py:21: "page_sets",
"alexa1-10000-urls.json")
On 2017/05/09 00:00:29, Ryan Sleevi wrote:
> It seems like this should be an input parameter to the script, so that the
> dependency can be captured by BUILD.gn
> 
> Did I miss a reply to the concerns previously raised that this list is very
out
> of date?
> 
> You may wish to check with OSPO on this, since it'll be shipped in the Chrome
> binary, which is different than how it's being used today.

I checked with Chrome counsel (which is why I was 'silent' for a while on this
CL).   Using the list (outdated/old) already in the chromium tree was
green-lighted.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/top_domains/make_alexa_top_list.py:46: # Add some
popular domains if they're missing.
On 2017/05/09 00:00:29, Ryan Sleevi wrote:
> I'm not sure why this part - it seems to be more subjective?

Yes, it's subjective. Because the list is old, some newer domains are not in the
list. And, if the cut-off is < 6000, gmail and hotmail wouldn't make the list,
which is why I'm adding them manually.   360, ntd, onckds are from the publicly
available Alexa top 50 list.

jungshik at Google

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatter/top_domains/make_alexa_top_list.py File components/url_formatter/top_domains/make_alexa_top_list.py (right): https://codereview.chromium.org/2784933002/diff/730001/components/url_formatter/top_domains/make_alexa_top_list.py#newcode21 components/url_formatter/top_domains/make_alexa_top_list.py:21: "page_sets", "alexa1-10000-urls.json") On 2017/05/09 00:00:29, Ryan Sleevi wrote: > ...

3 years, 7 months ago (2017-05-09 20:19:42 UTC) #113

Peter Kasting

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatter/top_domains/make_alexa_top_list.py File components/url_formatter/top_domains/make_alexa_top_list.py (right): https://codereview.chromium.org/2784933002/diff/730001/components/url_formatter/top_domains/make_alexa_top_list.py#newcode21 components/url_formatter/top_domains/make_alexa_top_list.py:21: "page_sets", "alexa1-10000-urls.json") On 2017/05/09 19:57:39, jungshik at Google wrote: ...

3 years, 7 months ago (2017-05-09 20:50:14 UTC) #114

jungshik at Google

Addressed most of Peter's comments in the latest PS https://codereview.chromium.org/2784933002/diff/730001/components/url_formatter/top_domains/README File components/url_formatter/top_domains/README (right): https://codereview.chromium.org/2784933002/diff/730001/components/url_formatter/top_domains/README#newcode2 components/url_formatter/top_domains/README:2: ...

3 years, 7 months ago (2017-05-10 18:05:16 UTC) #115

Addressed most of Peter's comments in the latest PS

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
File components/url_formatter/top_domains/README (right):

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/top_domains/README:2: It is an input to
make_top_domain_list and is made up of list of Alexa
On 2017/05/09 01:37:02, Peter Kasting wrote:
> Nit: "The Alexa top 10k domains, one per line, constructed by running
> make_alexa_top_list.py.  Used as an input to make_top_domain_gperf."
> 
> I assume that "list" here and below should have been "gperf".
> 
> Also, be consistent about whether or not to use a newline between the bulleted
> name above and the explanation here.

Thanks. Done

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/top_domains/README:9: It is generated by running
make_top_domain_list and checked in.
On 2017/05/09 01:37:02, Peter Kasting wrote:
> Nit: "The checked-in output of make_top_domain_gperf.  Processed during the
> build to generate alexa_names_and_skeletons-inc.cc, which is used by
> url_formatter.cc.  This must be regenerated as follows if ICU is updated,
since
> skeletons can differ across ICU versions: <instructions>"

Thank you for a much better rewrite. 

> This makes me wonder if there should be a README note in the ICU directory
that
> says to regenerate this.

I'll add it to README.chromium in third_party/icu.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
File components/url_formatter/top_domains/make_alexa_top_list.py (right):

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/top_domains/make_alexa_top_list.py:3: # # Use of this
source code is governed by a BSD-style license that can be
On 2017/05/09 01:37:02, Peter Kasting wrote:
> Nit: No need for # #? (2 places)

Done.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/top_domains/make_alexa_top_list.py:6: """Generate
alexa_domains.list from
On 2017/05/09 01:37:02, Peter Kasting wrote:
> Nit: Generates?

Done.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/top_domains/make_alexa_top_list.py:21: "page_sets",
"alexa1-10000-urls.json")
On 2017/05/09 20:50:14, Peter Kasting wrote:
> On 2017/05/09 19:57:39, jungshik at Google wrote:
> > On 2017/05/09 00:00:29, Ryan Sleevi wrote:
> > > It seems like this should be an input parameter to the script, so that the
> > > dependency can be captured by BUILD.gn
> > > 
> > > Did I miss a reply to the concerns previously raised that this list is
very
> > out
> > > of date?
> > > 
> > > You may wish to check with OSPO on this, since it'll be shipped in the
> Chrome
> > > binary, which is different than how it's being used today.
> > 
> > I checked with Chrome counsel (which is why I was 'silent' for a while on
this
> > CL).   Using the list (outdated/old) already in the chromium tree was
> > green-lighted. 
> 
> Separately, are we going to be updating that list?  It would be nice to have a
> more up-to-date list which obviates the need for the manual additions below. 
If
> this will be in the tree long-term, we also need a process to ensure it's
> regularly maintained, including a person or team who will be on the hook to
> maintain it.

That's a weak link because up-to-date Alexa list is not public. (only the top 50
list is public). I wish there were an alternative
public list.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
File components/url_formatter/top_domains/make_top_domain_gperf.cc (right):

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/top_domains/make_top_domain_gperf.cc:44: std::cerr <<
"failed to write to " << path.AsUTF8Unsafe()
On 2017/05/09 01:37:02, Peter Kasting wrote:
> Nit: Initial caps?  (several places)

Done.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/top_domains/make_top_domain_gperf.cc:48: return true;
On 2017/05/09 01:37:02, Peter Kasting wrote:
> Nit: Shorter:
> 
>   bool succeeded = base::WriteFile(path, content.data(), content.size()) !=
-1;
>   if (!succeeded)
>     std::cerr << "failed to write to " << path.AsUTF8Unsafe() << '\n';
>   return succeeded;
> 
> ...though I wonder if "!= -1" should be a check that you actually wrote the
full
> size expected.

Ok. Turned it into CHECK.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/top_domains/make_top_domain_gperf.cc:53: std::cerr <<
"Generate the list of top domain skeletons to use as\n"
On 2017/05/09 01:37:03, Peter Kasting wrote:
> Nit: Generates
> 
> Seems like this could be wrapped closer to 80 columns of output too. 
Something
> like this (which also takes into account the next two comments):
> 
>     std::cerr << "Generates the list of top domain skeletons to use as input
to"
>                  "\nbase/dafsa/make_dafsa.py.\nUsage: " << argv[0] << '\n';

Done.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/top_domains/make_top_domain_gperf.cc:54: << "input to
base/dafsa/make_dafsa.py\n";
On 2017/05/09 01:37:02, Peter Kasting wrote:
> Nit: Leading << not necessary when continuing previous string constant
(several
> places)

Done.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/top_domains/make_top_domain_gperf.cc:55: std::cerr <<
"Usage: " << argv[0] << '\n';
On 2017/05/09 01:37:02, Peter Kasting wrote:
> Nit: Why not just continue << from the previous statement?

Done.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/top_domains/make_top_domain_gperf.cc:79:
std::stringstream output;
On 2017/05/09 01:37:02, Peter Kasting wrote:
> As far as I can tell, you just append unformatted strings to this.  Any reason
> not to just use a string and += directly?

You.re right. Thank you for catching it.
 At first, it's an ostream to a file which I later turned to stringstream
without thinking much when rewriting to use base::Path, etc. Changed to
std::string and "+=".

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/top_domains/make_top_domain_gperf.cc:89: <<
"confusability check.\n"
On 2017/05/09 01:37:02, Peter Kasting wrote:
> Nit: "...for the confusability check in <sourcefile name>" (or
<function_name>)?

Done.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/top_domains/make_top_domain_gperf.cc:94: std::string
domains_with_max_labels;
On 2017/05/09 01:37:02, Peter Kasting wrote:
> Nit: domain, singular?

Done.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/top_domains/make_top_domain_gperf.cc:111: 
On 2017/05/09 01:37:02, Peter Kasting wrote:
> Nit: No blank line

Done.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
File components/url_formatter/url_formatter.cc (right):

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/url_formatter.cc:203: class IDNSpoofChecker {
On 2017/05/09 01:37:03, Peter Kasting wrote:
> Nit: It might be nice to pull this class out to its own .h/.cc for maximum
> readability.

Ok. pulled it out.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/url_formatter.cc:207: // Returns true if |label| is
safe to display as Unicode. When the TLD is
On 2017/05/09 01:37:03, Peter Kasting wrote:
> Nit: Does the second sentence here really need to be here?  It seems like it
> only describes a portion of the functionality of the function.  Maybe we
should
> just say "See the function body for details on the specific safety checks
> performed"?

Yeah, that's better.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/url_formatter.cc:211: bool Check(base::StringPiece16
label, bool is_tld_ascii);
On 2017/05/09 01:37:03, Peter Kasting wrote:
> Nit: This is a poor function name; how about something like
> SafeToDisplayAsUnicode()?

Done.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/url_formatter.cc:224: void
SetAllowedUnicodeSet(UErrorCode* status);
On 2017/05/09 01:37:03, Peter Kasting wrote:
> Nit: I suggest adding comments for these even though they're private.

Done.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/url_formatter.cc:270: bool has_idn_component = false;
On 2017/05/09 01:37:03, Peter Kasting wrote:
> Can we reach this function with an input that doesn't cause the loop below to
> set this to true?  It seems unlikely.  If not, we could eliminate this
variable.

IDNToUnicode (which calls this function) is called with any host name out of
GURL. So, has_idn_component can be false for all labels/components. Then,
has_idn_component would be false, too.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/url_formatter.cc:286: has_idn_component =
has_idn_component || converted_idn;
On 2017/05/09 01:37:03, Peter Kasting wrote:
> Nit: Or use |=

Changed.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/url_formatter.cc:300: if (has_idn_component &&
On 2017/05/09 01:37:03, Peter Kasting wrote:
> Nit: Might want a comment above this block like "Leave as punycode any inputs
> that spoof top domains."

Done.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/url_formatter.cc:305: }
On 2017/05/09 01:37:03, Peter Kasting wrote:
> Nit: Blank line after this?

Done.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/url_formatter.cc:402: if (U_FAILURE(status))
On 2017/05/09 01:37:03, Peter Kasting wrote:
> Do not handle DCHECK failure; assume DCHECKs cannot fail.  If they can, they
> should be conditionals, not DCHECKs.

Done.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/url_formatter.cc:436: //   - it's made entirely of
Cyrillic letters that look like Latin letters.
On 2017/05/09 01:37:03, Peter Kasting wrote:
> Nit: it's -> the TLD is ASCII, and the input is ?

Done.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/url_formatter.cc:454: // Note that non-ASCII Latin
check should not be applied when the entire label
On 2017/05/09 01:37:03, Peter Kasting wrote:
> Nit: that -> that the ?

Done.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/url_formatter.cc:515: const size_t
kNumberOfLabelsToCheck = 3;
On 2017/05/09 01:37:03, Peter Kasting wrote:
> Can we write this value into the file so we don't need to hardcode it here?

make_top_domain_gperf  can write that out to another file (the only line other
than license boilerplate would be the above line). Do you like that?

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/url_formatter.cc:517: bool
LookupStringInSet(base::StringPiece needle,
On 2017/05/09 01:37:04, Peter Kasting wrote:
> Nit: If you're not going to use boring names for your params, I'd copy the
ones
> from the underlying net:: declaration rather than using |needle|.
> 
> That said, this wrapper is so short, and is called only once, that I'd just
> inline the body of this at the callsite below.

Inlined it.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/url_formatter.cc:528: DCHECK(hostname[hostname.length()
- 1] != '.');
On 2017/05/09 01:37:03, Peter Kasting wrote:
> Nit: hostname.back()

Done.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/url_formatter.cc:533: labels.erase(labels.begin());
On 2017/05/09 01:37:04, Peter Kasting wrote:
> Nit: Seems like a single call to vector::erase could be more efficient than a
> while loop.

Done.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/url_formatter.cc:535: while (labels.size() > 1) {
On 2017/05/09 01:37:03, Peter Kasting wrote:
> Is this naive loop faster than computing the actual eTLD+1 length using the
RCDS
> and then doing a single DAFSA lookup?

'hostname' is not a good name (at one point, it's either a hostname or its
skeleton, but now it's always a skeleton). Changed it to |skeleton|. eTLD match
cannot be done because even 'com' is turned to 'c o r n' (without spaces).  

We can try the eTLD match before the skeleton calculation.  Hmm,   it appears
that RCDS canonicalizes 'hostname' before finding eTLD+1.   The canonicalization
would turn an IDN to punycode That would not work here. 

Because # of labels is limited to 3 here, at most two look ups are done here.
So, I'd expect little difference even if RCDS works without canonicalization.. 

OTOH, thanks to your suggestion, it occurred to me that I can change 2-step
(python + C++) into one step (C++ using RCDS that accepts URLs to extract eTLD
+1)

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/url_formatter.cc:546: (*(hostname.rbegin()) == '.' ? 1
: 0);
On 2017/05/09 01:37:03, Peter Kasting wrote:
> Nit: Use .back() instead of *rbegin()

Done.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/url_formatter.cc:552: ustr_host.length() &&
transliterator_)
On 2017/05/09 01:37:03, Peter Kasting wrote:
> Note that if the DCHECK earlier is assumed not to fail, this null-check can
> disappear.

removed it.

https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt...
components/url_formatter/url_formatter.cc:557:
uspoof_getSkeletonUnicodeString(checker_, 0, /* not used. deprecated. */
On 2017/05/09 01:37:03, Peter Kasting wrote:
> Nit: If you're going to add /* */ (which I'm not sure is necessary), do so
> before the comma to make it very clear which parameter this is on.
> 
> "deprecated." is also probably unnecessary here.

ok. just removed it.

jungshik at Google

Addressed most of Peter's comments in the latest PS

3 years, 7 months ago (2017-05-10 18:05:21 UTC) #116

jungshik at Google

On 2017/05/10 18:05:21, jungshik at Google wrote: > Addressed most of Peter's comments in the ...

3 years, 7 months ago (2017-05-10 18:06:30 UTC) #117

ncarter (slow)

+nparker for the following question: This proposed defense against IDN homograph spoofing requires a list ...

3 years, 7 months ago (2017-05-10 18:41:34 UTC) #118

Peter Kasting

This is looking pretty good. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatter/url_formatter.cc File components/url_formatter/url_formatter.cc (right): https://codereview.chromium.org/2784933002/diff/730001/components/url_formatter/url_formatter.cc#newcode515 components/url_formatter/url_formatter.cc:515: const size_t kNumberOfLabelsToCheck = ...

3 years, 7 months ago (2017-05-10 22:38:47 UTC) #119

jungshik at Google

The CQ bit was checked by jshin@chromium.org to run a CQ dry run

3 years, 7 months ago (2017-05-14 09:24:50 UTC) #120

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2784933002/830001

3 years, 7 months ago (2017-05-14 09:24:55 UTC) #121

jungshik at Google

Thanks for a thorough review. (as you found out, pulling out IDNSpoofChecker was done separately ...

3 years, 7 months ago (2017-05-14 09:36:23 UTC) #122

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 7 months ago (2017-05-14 10:33:21 UTC) #123

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 7 months ago (2017-05-14 10:33:23 UTC) #124

ncarter (slow)

https://codereview.chromium.org/2784933002/diff/750001/components/url_formatter/url_formatter_unittest.cc File components/url_formatter/url_formatter_unittest.cc (right): https://codereview.chromium.org/2784933002/diff/750001/components/url_formatter/url_formatter_unittest.cc#newcode286 components/url_formatter/url_formatter_unittest.cc:286: // 'digklmo68.co.uk" are listed for unittest in the top ...

3 years, 7 months ago (2017-05-15 18:27:57 UTC) #125

Peter Kasting

LGTM, I trust you to follow up appropriately on outstanding issues https://codereview.chromium.org/2784933002/diff/750001/components/url_formatter/url_formatter_unittest.cc File components/url_formatter/url_formatter_unittest.cc (right): ...

3 years, 7 months ago (2017-05-15 18:55:31 UTC) #126

LGTM, I trust you to follow up appropriately on outstanding issues

https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt...
File components/url_formatter/url_formatter_unittest.cc (right):

https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt...
components/url_formatter/url_formatter_unittest.cc:286: // 'digklmo68.co.uk" are
listed for unittest in the top domain list.
On 2017/05/15 18:27:57, ncarter wrote:
> On 2017/05/14 09:36:23, jungshik at Google wrote:
> > On 2017/05/10 22:38:47, Peter Kasting wrote:
> > > I still would like to see the test poke these into place rather than have
> them
> > > included in the compiled-in list.
> > 
> > I don't disagree with you, but couldn't think of a simple way to add them
only
> > for test at run-time. 
> > 
> > Perhaps, I can do something similar to what's done in 
> > net/base/registry_controlled_domains/registry_controlled_domain.h and
related
> > files.  
> 
> The way the other unittests work, we do a separate DAFSA generation step on
the
> test data, and there's a SetDafsaForTesting() function (SetFindDomainGraph)
that
> lets you swap out the global DAFSA structure.

SGTM

https://codereview.chromium.org/2784933002/diff/830001/components/url_formatt...
File components/url_formatter/idn_spoof_checker.h (right):

https://codereview.chromium.org/2784933002/diff/830001/components/url_formatt...
components/url_formatter/idn_spoof_checker.h:57: // Returns true if all the
Cyrillic letters in |label| belong to a set of
Nit: I would put a blank line above this comment and the next one, just to make
"comment + function" easier for the eye to parse visually.

https://codereview.chromium.org/2784933002/diff/830001/components/url_formatt...
components/url_formatter/idn_spoof_checker.h:61: // successfully and stored in
|skeleton|.
Nit: This comment sounds like the function just checks the skeleton rather than
computing it?  Maybe "Stores the confusability skeleton for |hostname| in
|skeleton|.  Returns whether the computation was successful."?

All that said, I don't actually see this method defined anywhere in the
codebase.  Maybe it doesn't really exist?

jungshik at Google

On 2017/05/15 18:27:57, ncarter (slow) wrote: > https://codereview.chromium.org/2784933002/diff/750001/components/url_formatter/url_formatter_unittest.cc > File components/url_formatter/url_formatter_unittest.cc (right): > > https://codereview.chromium.org/2784933002/diff/750001/components/url_formatter/url_formatter_unittest.cc#newcode286 ...

3 years, 7 months ago (2017-05-17 22:26:09 UTC) #127

jungshik at Google

Thank you, Peter and Nick. > LGTM, I trust you to follow up appropriately on ...

3 years, 7 months ago (2017-05-17 23:11:04 UTC) #128

Thank you, Peter and Nick. 

> LGTM, I trust you to follow up appropriately on outstanding issues

Thank you for trusting me :-)

Would you mind if land this CL as it is now (with nits taken care of ) and make
a follow-up CL to address those issues?  I already have a CL that does #1 and
partly #2, but have been swamped with other issues. 

Because a few IDN-related bugs would be resolved with this CL, I like to get
this one landed and work on the follow-up.

1) use a 1-step instead of 2-step process to generate *gper file; this can be
done using one of methods in r-c-d 
2) have a separate gperf file for unit test
3) avoid hard-coding the max number of labels to test for match; r-c-d can help
this too except that I need a method to r-c-d that would treat an input domain
names as canonicalized (although it's not because it has IDN) without checking.
This would be used to get eTLD+1 (where eTLD is all ASCII). There's already one
in r-c-d but in a debug build it'd barf that the input is not canonicalized when
I pass an IDN.

https://codereview.chromium.org/2784933002/diff/830001/components/url_formatt...
File components/url_formatter/idn_spoof_checker.h (right):

https://codereview.chromium.org/2784933002/diff/830001/components/url_formatt...
components/url_formatter/idn_spoof_checker.h:57: // Returns true if all the
Cyrillic letters in |label| belong to a set of
On 2017/05/15 18:55:31, Peter Kasting wrote:
> Nit: I would put a blank line above this comment and the next one, just to
make
> "comment + function" easier for the eye to parse visually.

Done.

https://codereview.chromium.org/2784933002/diff/830001/components/url_formatt...
components/url_formatter/idn_spoof_checker.h:61: // successfully and stored in
|skeleton|.
On 2017/05/15 18:55:31, Peter Kasting wrote:
> Nit: This comment sounds like the function just checks the skeleton rather
than
> computing it?  Maybe "Stores the confusability skeleton for |hostname| in
> |skeleton|.  Returns whether the computation was successful."?
> 
> All that said, I don't actually see this method defined anywhere in the
> codebase.  Maybe it doesn't really exist?

ooops. Yeah, it's gone a few PS's ago. Thank you for the catch.

jungshik at Google

Description was changed from ========== Add checks against spoofing attempt at top domains Remove diacritic ...

3 years, 7 months ago (2017-05-17 23:25:01 UTC) #130

Description was changed from

==========
Add checks against spoofing attempt at top domains

Remove diacritic marks from a hostname and calculate the confusability
skeleton of the accent-free name. Look it up in the pre-calculated list of
the skeletons of top 10k domains.

Removing diacritic marks from a hostname is equivalent to comparing names with
the primary collation strength in the root locale. To make them equivalent,
three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.
Also add two more mappings ([кĸκ] > k,  п > n) to supplement the Unicode's
confusables list.

Binary file size increase: ~ 59kB for the DAFSA representation of top
domain name skeletons.

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4571)

In addition, disallow combining diarctic marks unless they're preceded by
Latin-Greek-Cyrillic.

BUG=703750,714628,719199
TEST=components_unittests --gtest_filter=*IDNToUni*
==========

to

==========
Add checks against spoofing attempt at top domains

Remove diacritic marks from a hostname and calculate the confusability
skeleton of the accent-free name. Look it up in the pre-calculated list of
the skeletons of top 10k domains.

Removing diacritic marks from a hostname is equivalent to comparing names with
the primary collation strength in the root locale. To make them equivalent,
three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.
Also add two more mappings ([кĸκ] > k,  п > n) to supplement the Unicode's
confusables list.

Binary file size increase: ~ 59kB for the DAFSA representation of top
domain name skeletons.

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4571)

In addition, disallow combining diarctic marks unless they're preceded by
Latin-Greek-Cyrillic.

BUG=703750,714628,719199,722639
TEST=components_unittests --gtest_filter=*IDNToUni*
==========

jungshik at Google

The CQ bit was checked by jshin@chromium.org

3 years, 7 months ago (2017-05-19 05:29:38 UTC) #131

jungshik at Google

The patchset sent to the CQ was uploaded after l-g-t-m from nick@chromium.org, pkasting@chromium.org Link to ...

3 years, 7 months ago (2017-05-19 05:29:40 UTC) #132

commit-bot: I haz the power

CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2784933002/850001

3 years, 7 months ago (2017-05-19 05:30:23 UTC) #133

commit-bot: I haz the power

CQ is committing da patch. Bot data: {"patchset_id": 850001, "attempt_start_ts": 1495171778009680, "parent_rev": "375fc7a1d4e279c675ce23f239604f8aa80fef53", "commit_rev": "a8add0308ba6067eb3de5a8fe82f9c2f2460ad91"}

3 years, 7 months ago (2017-05-19 06:49:19 UTC) #134

commit-bot: I haz the power

Description was changed from ========== Add checks against spoofing attempt at top domains Remove diacritic ...

3 years, 7 months ago (2017-05-19 06:49:33 UTC) #135

Message was sent while issue was closed.

Description was changed from

==========
Add checks against spoofing attempt at top domains

Remove diacritic marks from a hostname and calculate the confusability
skeleton of the accent-free name. Look it up in the pre-calculated list of
the skeletons of top 10k domains.

Removing diacritic marks from a hostname is equivalent to comparing names with
the primary collation strength in the root locale. To make them equivalent,
three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.
Also add two more mappings ([кĸκ] > k,  п > n) to supplement the Unicode's
confusables list.

Binary file size increase: ~ 59kB for the DAFSA representation of top
domain name skeletons.

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4571)

In addition, disallow combining diarctic marks unless they're preceded by
Latin-Greek-Cyrillic.

BUG=703750,714628,719199,722639
TEST=components_unittests --gtest_filter=*IDNToUni*
==========

to

==========
Add checks against spoofing attempt at top domains

Remove diacritic marks from a hostname and calculate the confusability
skeleton of the accent-free name. Look it up in the pre-calculated list of
the skeletons of top 10k domains.

Removing diacritic marks from a hostname is equivalent to comparing names with
the primary collation strength in the root locale. To make them equivalent,
three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.
Also add two more mappings ([кĸκ] > k,  п > n) to supplement the Unicode's
confusables list.

Binary file size increase: ~ 59kB for the DAFSA representation of top
domain name skeletons.

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4571)

In addition, disallow combining diarctic marks unless they're preceded by
Latin-Greek-Cyrillic.

BUG=703750,714628,719199,722639
TEST=components_unittests --gtest_filter=*IDNToUni*

Review-Url: https://codereview.chromium.org/2784933002
Cr-Commit-Position: refs/heads/master@{#473109}
Committed:
https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f...
==========

commit-bot: I haz the power

Committed patchset #47 (id:850001) as https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f9c2f2460ad91

3 years, 7 months ago (2017-05-19 06:49:36 UTC) #136

tsergeant

A revert of this CL (patchset #47 id:850001) has been created in https://codereview.chromium.org/2889303003/ by tsergeant@chromium.org. ...

3 years, 7 months ago (2017-05-19 07:23:40 UTC) #137

findit-for-me

Findit (https://goo.gl/kROfz5) confirmed this CL at revision 473109 as the culprit for failures in the ...

3 years, 7 months ago (2017-05-19 07:29:55 UTC) #138

jungshik at Google

Description was changed from ========== Add checks against spoofing attempt at top domains Remove diacritic ...

3 years, 7 months ago (2017-05-22 01:09:41 UTC) #139

Message was sent while issue was closed.

Description was changed from

==========
Add checks against spoofing attempt at top domains

Remove diacritic marks from a hostname and calculate the confusability
skeleton of the accent-free name. Look it up in the pre-calculated list of
the skeletons of top 10k domains.

Removing diacritic marks from a hostname is equivalent to comparing names with
the primary collation strength in the root locale. To make them equivalent,
three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.
Also add two more mappings ([кĸκ] > k,  п > n) to supplement the Unicode's
confusables list.

Binary file size increase: ~ 59kB for the DAFSA representation of top
domain name skeletons.

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4571)

In addition, disallow combining diarctic marks unless they're preceded by
Latin-Greek-Cyrillic.

BUG=703750,714628,719199,722639
TEST=components_unittests --gtest_filter=*IDNToUni*

Review-Url: https://codereview.chromium.org/2784933002
Cr-Commit-Position: refs/heads/master@{#473109}
Committed:
https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f...
==========

to

==========
Add checks against spoofing attempt at top domains

Remove diacritic marks from a hostname and calculate the confusability
skeleton of the accent-free name. Look it up in the pre-calculated list of
the skeletons of top 10k domains.

Removing diacritic marks from a hostname is equivalent to comparing names with
the primary collation strength in the root locale. To make them equivalent,
three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.
Also add two more mappings ([кĸκ] > k,  п > n) to supplement the Unicode's
confusables list.

Binary file size increase: ~ 59kB for the DAFSA representation of top
domain name skeletons.

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4571)

In addition, disallow combining diarctic marks unless they're preceded by
Latin-Greek-Cyrillic.

BUG=703750,714628,719199,722639
TEST=components_unittests --gtest_filter=*IDNToUni*

Review-Url: https://codereview.chromium.org/2784933002
Cr-Commit-Position: refs/heads/master@{#473109}
Committed:
https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f...
==========

jungshik at Google

Description was changed from ========== Add checks against spoofing attempt at top domains Remove diacritic ...

3 years, 7 months ago (2017-05-22 01:15:46 UTC) #140

Description was changed from

==========
Add checks against spoofing attempt at top domains

Remove diacritic marks from a hostname and calculate the confusability
skeleton of the accent-free name. Look it up in the pre-calculated list of
the skeletons of top 10k domains.

Removing diacritic marks from a hostname is equivalent to comparing names with
the primary collation strength in the root locale. To make them equivalent,
three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.
Also add two more mappings ([кĸκ] > k,  п > n) to supplement the Unicode's
confusables list.

Binary file size increase: ~ 59kB for the DAFSA representation of top
domain name skeletons.

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4571)

In addition, disallow combining diarctic marks unless they're preceded by
Latin-Greek-Cyrillic.

BUG=703750,714628,719199,722639
TEST=components_unittests --gtest_filter=*IDNToUni*

Review-Url: https://codereview.chromium.org/2784933002
Cr-Commit-Position: refs/heads/master@{#473109}
Committed:
https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f...
==========

to

==========
Add checks against spoofing attempt at top domains

Remove diacritic marks from a hostname and calculate the confusability
skeleton of the accent-free name. Look it up in the pre-calculated list of
the skeletons of top 10k domains.

Removing diacritic marks from a hostname is equivalent to comparing names with
the primary collation strength in the root locale. To make them equivalent,
three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.
Also add two more mappings ([кĸκ] > k,  п > n) to supplement the Unicode's
confusables list.

Binary file size increase: ~ 59kB for the DAFSA representation of top
domain name skeletons.

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4571)

In addition, disallow combining diarctic marks unless they're preceded by
Latin-Greek-Cyrillic.

BUG=703750,714628,719199,722639
TEST=components_unittests --gtest_filter=*IDNToUni*
CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.win:win_chromium_x64_rel_ng,win_clang_x64_rel,win10_chromium_x64_rel_ng

Review-Url: https://codereview.chromium.org/2784933002
Cr-Commit-Position: refs/heads/master@{#473109}
Committed:
https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f...
==========

jungshik at Google

Description was changed from ========== Add checks against spoofing attempt at top domains Remove diacritic ...

3 years, 7 months ago (2017-05-22 05:15:36 UTC) #141

Description was changed from

==========
Add checks against spoofing attempt at top domains

Remove diacritic marks from a hostname and calculate the confusability
skeleton of the accent-free name. Look it up in the pre-calculated list of
the skeletons of top 10k domains.

Removing diacritic marks from a hostname is equivalent to comparing names with
the primary collation strength in the root locale. To make them equivalent,
three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.
Also add two more mappings ([кĸκ] > k,  п > n) to supplement the Unicode's
confusables list.

Binary file size increase: ~ 59kB for the DAFSA representation of top
domain name skeletons.

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4571)

In addition, disallow combining diarctic marks unless they're preceded by
Latin-Greek-Cyrillic.

BUG=703750,714628,719199,722639
TEST=components_unittests --gtest_filter=*IDNToUni*
CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.win:win_chromium_x64_rel_ng,win_clang_x64_rel,win10_chromium_x64_rel_ng

Review-Url: https://codereview.chromium.org/2784933002
Cr-Commit-Position: refs/heads/master@{#473109}
Committed:
https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f...
==========

to

==========
Add checks against spoofing attempt at top domains

Remove diacritic marks from a hostname and calculate the confusability
skeleton of the accent-free name. Look it up in the pre-calculated list of
the skeletons of top 10k domains.

Removing diacritic marks from a hostname is equivalent to comparing names with
the primary collation strength in the root locale. To make them equivalent,
three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.
Also add two more mappings ([кĸκ] > k,  п > n) to supplement the Unicode's
confusables list.

Binary file size increase: ~ 59kB for the DAFSA representation of top
domain name skeletons.

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4571)

In addition, disallow combining diarctic marks unless they're preceded by
Latin-Greek-Cyrillic.

BUG=703750,714628,719199,722639
TEST=components_unittests --gtest_filter=*IDNToUni*
CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.win:win_chromium_x64_rel_ng,win10_chromium_x64_rel_ng

Review-Url: https://codereview.chromium.org/2784933002
Cr-Commit-Position: refs/heads/master@{#473109}
Committed:
https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f...
==========

jungshik at Google

Description was changed from ========== Add checks against spoofing attempt at top domains Remove diacritic ...

3 years, 7 months ago (2017-05-22 05:16:24 UTC) #142

Description was changed from

==========
Add checks against spoofing attempt at top domains

Remove diacritic marks from a hostname and calculate the confusability
skeleton of the accent-free name. Look it up in the pre-calculated list of
the skeletons of top 10k domains.

Removing diacritic marks from a hostname is equivalent to comparing names with
the primary collation strength in the root locale. To make them equivalent,
three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.
Also add two more mappings ([кĸκ] > k,  п > n) to supplement the Unicode's
confusables list.

Binary file size increase: ~ 59kB for the DAFSA representation of top
domain name skeletons.

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4571)

In addition, disallow combining diarctic marks unless they're preceded by
Latin-Greek-Cyrillic.

BUG=703750,714628,719199,722639
TEST=components_unittests --gtest_filter=*IDNToUni*
CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.win:win_chromium_x64_rel_ng,win10_chromium_x64_rel_ng

Review-Url: https://codereview.chromium.org/2784933002
Cr-Commit-Position: refs/heads/master@{#473109}
Committed:
https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f...
==========

to

==========
Add checks against spoofing attempt at top domains

Relanding after revert with a compile fix for Win x64.

Remove diacritic marks from a hostname and calculate the confusability
skeleton of the accent-free name. Look it up in the pre-calculated list of
the skeletons of top 10k domains.

Removing diacritic marks from a hostname is equivalent to comparing names with
the primary collation strength in the root locale. To make them equivalent,
three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.
Also add two more mappings ([кĸκ] > k,  п > n) to supplement the Unicode's
confusables list.

Binary file size increase: ~ 59kB for the DAFSA representation of top
domain name skeletons.

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4571)

In addition, disallow combining diarctic marks unless they're preceded by
Latin-Greek-Cyrillic.

BUG=703750,714628,719199,722639
TEST=components_unittests --gtest_filter=*IDNToUni*
CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.win:win_chromium_x64_rel_ng,win10_chromium_x64_rel_ng

Review-Url: https://codereview.chromium.org/2784933002
Cr-Commit-Position: refs/heads/master@{#473109}
Committed:
https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f...
==========

jungshik at Google

Description was changed from ========== Add checks against spoofing attempt at top domains Relanding after ...

3 years, 7 months ago (2017-05-22 05:21:10 UTC) #143

Description was changed from

==========
Add checks against spoofing attempt at top domains

Relanding after revert with a compile fix for Win x64.

Remove diacritic marks from a hostname and calculate the confusability
skeleton of the accent-free name. Look it up in the pre-calculated list of
the skeletons of top 10k domains.

Removing diacritic marks from a hostname is equivalent to comparing names with
the primary collation strength in the root locale. To make them equivalent,
three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.
Also add two more mappings ([кĸκ] > k,  п > n) to supplement the Unicode's
confusables list.

Binary file size increase: ~ 59kB for the DAFSA representation of top
domain name skeletons.

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4571)

In addition, disallow combining diarctic marks unless they're preceded by
Latin-Greek-Cyrillic.

BUG=703750,714628,719199,722639
TEST=components_unittests --gtest_filter=*IDNToUni*
CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.win:win_chromium_x64_rel_ng,win10_chromium_x64_rel_ng

Review-Url: https://codereview.chromium.org/2784933002
Cr-Commit-Position: refs/heads/master@{#473109}
Committed:
https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f...
==========

to

==========
Add checks against spoofing attempt at top domains

Relanding after revert with a compile fix for Win x64.

Remove diacritic marks from a hostname and calculate the confusability
skeleton of the accent-free name. Look it up in the pre-calculated list of
the skeletons of top 10k domains.

Removing diacritic marks from a hostname is equivalent to comparing names with
the primary collation strength in the root locale. To make them equivalent,
three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.
Also add two more mappings ([кĸκ] > k,  п > n) to supplement the Unicode's
confusables list.

Binary file size increase: ~ 59kB for the DAFSA representation of top
domain name skeletons.

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4571)

In addition, disallow combining diarctic marks unless they're preceded by
Latin-Greek-Cyrillic.

BUG=703750,714628,719199,722639
TEST=components_unittests --gtest_filter=*IDNToUni*
CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.win:win_chromium_x64_rel_ng,win10_chromium_x64_rel_ng

Review-Url: https://codereview.chromium.org/2784933002
Cr-Commit-Position: refs/heads/master@{#473109}
Committed:
https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f...
==========

jungshik at Google

Description was changed from ========== Add checks against spoofing attempt at top domains Relanding after ...

3 years, 7 months ago (2017-05-22 05:21:33 UTC) #144

Message was sent while issue was closed.

Description was changed from

==========
Add checks against spoofing attempt at top domains

Relanding after revert with a compile fix for Win x64.

Remove diacritic marks from a hostname and calculate the confusability
skeleton of the accent-free name. Look it up in the pre-calculated list of
the skeletons of top 10k domains.

Removing diacritic marks from a hostname is equivalent to comparing names with
the primary collation strength in the root locale. To make them equivalent,
three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.
Also add two more mappings ([кĸκ] > k,  п > n) to supplement the Unicode's
confusables list.

Binary file size increase: ~ 59kB for the DAFSA representation of top
domain name skeletons.

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4571)

In addition, disallow combining diarctic marks unless they're preceded by
Latin-Greek-Cyrillic.

BUG=703750,714628,719199,722639
TEST=components_unittests --gtest_filter=*IDNToUni*
CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.win:win_chromium_x64_rel_ng,win10_chromium_x64_rel_ng

Review-Url: https://codereview.chromium.org/2784933002
Cr-Commit-Position: refs/heads/master@{#473109}
Committed:
https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f...
==========

to

==========
Add checks against spoofing attempt at top domains

Remove diacritic marks from a hostname and calculate the confusability
skeleton of the accent-free name. Look it up in the pre-calculated list of
the skeletons of top 10k domains.

Removing diacritic marks from a hostname is equivalent to comparing names with
the primary collation strength in the root locale. To make them equivalent,
three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.
Also add two more mappings ([кĸκ] > k,  п > n) to supplement the Unicode's
confusables list.

Binary file size increase: ~ 59kB for the DAFSA representation of top
domain name skeletons.

The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).

It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4571)

In addition, disallow combining diarctic marks unless they're preceded by
Latin-Greek-Cyrillic.

BUG=703750,714628,719199,722639
TEST=components_unittests --gtest_filter=*IDNToUni*
CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.win:win_chromium_x64_rel_ng,win10_chromium_x64_rel_ng

Review-Url: https://codereview.chromium.org/2784933002
Cr-Commit-Position: refs/heads/master@{#473109}
Committed:
https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f...
==========

jungshik at Google

3 years, 7 months ago (2017-05-22 05:22:43 UTC) #145

Message was sent while issue was closed.

Instead of reopening the committed CL, decided to make a new CL with 'size_t <->
int" conversion issue addressed with checked_cast<> for win_x64. 

https://codereview.chromium.org/2897873002

Issue 2784933002: Mitigate spoofing attempt using Latin letters. (Closed)