|
|
Created:
3 years, 8 months ago by jungshik at Google Modified:
3 years, 7 months ago CC:
chromium-reviews, emilyschechter Target Ref:
refs/heads/master Project:
chromium Visibility:
Public. |
DescriptionAdd checks against spoofing attempt at top domains
Remove diacritic marks from a hostname and calculate the confusability
skeleton of the accent-free name. Look it up in the pre-calculated list of
the skeletons of top 10k domains.
Removing diacritic marks from a hostname is equivalent to comparing names with
the primary collation strength in the root locale. To make them equivalent,
three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal.
Also add two more mappings ([кĸκ] > k, п > n) to supplement the Unicode's
confusables list.
Binary file size increase: ~ 59kB for the DAFSA representation of top
domain name skeletons.
The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs)
on my machine per the test run over ~1 million IDNs in com TLD).
It adds about 1500 domains to the list of domains to display in Punycode out
of ~ 1 million IDNs in com TLD. (3018 => 4571)
In addition, disallow combining diarctic marks unless they're preceded by
Latin-Greek-Cyrillic.
BUG=703750, 714628, 719199, 722639
TEST=components_unittests --gtest_filter=*IDNToUni*
CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.win:win_chromium_x64_rel_ng,win10_chromium_x64_rel_ng
Review-Url: https://codereview.chromium.org/2784933002
Cr-Commit-Position: refs/heads/master@{#473109}
Committed: https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f9c2f2460ad91
Patch Set 1 #Patch Set 2 : Use skeletons and sortkeys of top domains for check #Patch Set 3 : Add top500 domain skeletons/sortkeys #Patch Set 4 : Add make_topdomain_list generator #Patch Set 5 : fix format #Patch Set 6 : Add license headers to new files #Patch Set 7 : Remove the unused code and add a comment to SimilarToTopDomains() #Patch Set 8 : fix Windows build error and add test entries #Patch Set 9 : a bit more cleanup #Patch Set 10 : use top 10k list instead of 500 #Patch Set 11 : switch to DAFSA wip; dafsa unittests disabled to work around dep. check tempoarily #Patch Set 12 : Move dafsa tools/test data to base from net #Patch Set 13 : generate top domain skeleton/name list with make_dafsa.py #Patch Set 14 : add gperf files #Patch Set 15 : fix typos in BUIILD.gn #Patch Set 16 : fix another build issue - make_dafsa.py path fix #Patch Set 17 : Update make_top_domains_list to automatically generate gperf files #Patch Set 18 : Combine two DAFSAs into one (40kB size reduction and speed up) #Patch Set 19 : Make top_domains subdir and clean up unused files #Patch Set 20 : Add comments about the values in the gperf file #Patch Set 21 : Drop domains with more than 3 labels from the top domain list. #Patch Set 22 : Use UnicodeString::findAndReplace for 3 manual mapping #Patch Set 23 : another variation in UnicodeString ctor for findAndReplace #Patch Set 24 : use string16 ReplaceChars #Patch Set 25 : use string16 one-char replace #Patch Set 26 : Use RuleBasedTransliterator to handle "ł > l; ø > o; đ > d" #Patch Set 27 : Call transliterators only for domain names made of Latin. #Patch Set 28 : Add [\u0300-\u0331] to the set of characters for accent-free comparison #Patch Set 29 : add similarity check unittests #
Total comments: 12
Patch Set 30 : Move DAFSA-related files back to net/ from base/ #Patch Set 31 : more fixes: moving dafsa bak to net/ #Patch Set 32 : StringPiece16 for RemoveAcc, GetSkel.. #Patch Set 33 : Use 1 for domain names, 2 for skeleton, 3 for both #Patch Set 34 : use python to generate top domain list from json; drop 10k from gperf and *inc.cc names #Patch Set 35 : check in alex_names_and_skeletons.gperf #
Total comments: 6
Patch Set 36 : address ncarter's comments #Patch Set 37 : Remove diacritics and calculate skeleton before look-up #Patch Set 38 : Drop 'U+04CF (ӏ) -> l', add '[кĸκ] > k; п > n'. Fix accent check #Patch Set 39 : add a few popular domains #Patch Set 40 : Truncate '.' at the end of a hostname before similarity check #Patch Set 41 : add back U+04CF (ӏ) -> 'l' map #
Total comments: 77
Patch Set 42 : pull IDNSpoofChecker to separae h/cc files #
Total comments: 40
Patch Set 43 : rebased against refactoring CL #Patch Set 44 : rebased wrt master #Patch Set 45 : Delete top.list (accidentally added) #Patch Set 46 : Delete two more accidentally added files #
Total comments: 4
Patch Set 47 : Peter's latest comments addressed #Patch Set 48 : use checked_cast and make win64 happy #Messages
Total messages: 145 (96 generated)
Description was changed from ========== Sketch of detecting potential spoofing domains within a script BUG= ========== to ========== Sketch of detecting potential spoofing domains within a script BUG=703750 ==========
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: android_cronet on master.tryserver.chromium.android (JOB_FAILED, https://build.chromium.org/p/tryserver.chromium.android/builders/android_cron...) mac_chromium_compile_dbg_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_comp...) mac_chromium_rel_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_...)
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: ios-device-xcode-clang on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device-xcode-...) ios-simulator-xcode-clang on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator-xco...)
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: ios-simulator-xcode-clang on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator-xco...)
Description was changed from ========== Sketch of detecting potential spoofing domains within a script BUG=703750 ========== to ========== Add checks against spoofing attempt at top domains Two checks are added against potential spoofing attempts. 1. Calculate the confusable skeletons of a hostname and look it up in the pre-calculated list of the skeletons of top N domains 2. Remove diacritic marks from a hostname and compare it against the list of top N domains. BUG=703750 ==========
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: android_cronet on master.tryserver.chromium.android (JOB_FAILED, https://build.chromium.org/p/tryserver.chromium.android/builders/android_cron...) cast_shell_linux on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/cast_shell_linu...) ios-device on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device/builds...) ios-device-xcode-clang on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device-xcode-...)
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: ios-simulator on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator/bui...) mac_chromium_compile_dbg_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_comp...)
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: cast_shell_linux on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/cast_shell_linu...) ios-device on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device/builds...)
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: ios-device on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device/builds...)
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Description was changed from ========== Add checks against spoofing attempt at top domains Two checks are added against potential spoofing attempts. 1. Calculate the confusable skeletons of a hostname and look it up in the pre-calculated list of the skeletons of top N domains 2. Remove diacritic marks from a hostname and compare it against the list of top N domains. BUG=703750 ========== to ========== Add checks against spoofing attempt at top domains Two checks are added against potential spoofing attempts. 1. Calculate the confusability skeletons of a hostname and look it up in the pre-calculated list of the skeletons of top N domains. 2. Remove diacritic marks from a hostname and compare it against the list of top N domains. Binary file size increase: ~88kB for the combined DAFSA representation of top domain names and their skeletons. (Two separate DAFSA for names and skeletons takes up ~ 130kB .) The IDN display policy check takes ~ 3us longer (3.3 us => 6.7us on my machine per the test run over ~1 million IDNs in com TLD). BUG=703750 TEST=components_unittests --gtest_filter=*IDNToUni* ==========
Description was changed from ========== Add checks against spoofing attempt at top domains Two checks are added against potential spoofing attempts. 1. Calculate the confusability skeletons of a hostname and look it up in the pre-calculated list of the skeletons of top N domains. 2. Remove diacritic marks from a hostname and compare it against the list of top N domains. Binary file size increase: ~88kB for the combined DAFSA representation of top domain names and their skeletons. (Two separate DAFSA for names and skeletons takes up ~ 130kB .) The IDN display policy check takes ~ 3us longer (3.3 us => 6.7us on my machine per the test run over ~1 million IDNs in com TLD). BUG=703750 TEST=components_unittests --gtest_filter=*IDNToUni* ========== to ========== Add checks against spoofing attempt at top domains Two checks are added against potential spoofing attempts. 1. Calculate the confusability skeletons of a hostname and look it up in the pre-calculated list of the skeletons of top N domains. 2. Remove diacritic marks from a hostname and compare it against the list of top N domains. Binary file size increase: ~88kB for the combined DAFSA representation of top domain names and their skeletons. (Two separate DAFSA for names and skeletons takes up ~ 130kB .) The IDN display policy check takes ~ 3us longer (3.3 us => 6.7us on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1400 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4473) BUG=703750 TEST=components_unittests --gtest_filter=*IDNToUni* ==========
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Description was changed from ========== Add checks against spoofing attempt at top domains Two checks are added against potential spoofing attempts. 1. Calculate the confusability skeletons of a hostname and look it up in the pre-calculated list of the skeletons of top N domains. 2. Remove diacritic marks from a hostname and compare it against the list of top N domains. Binary file size increase: ~88kB for the combined DAFSA representation of top domain names and their skeletons. (Two separate DAFSA for names and skeletons takes up ~ 130kB .) The IDN display policy check takes ~ 3us longer (3.3 us => 6.7us on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1400 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4473) BUG=703750 TEST=components_unittests --gtest_filter=*IDNToUni* ========== to ========== Add checks against spoofing attempt at top domains Two checks are added against potential spoofing attempts. 1. Calculate the confusability skeletons of a hostname and look it up in the pre-calculated list of the skeletons of top N domains. 2. Remove diacritic marks from a hostname and compare it against the list of top N domains. Binary file size increase: ~88kB for the combined DAFSA representation of top domain names and their skeletons. (Two separate DAFSA for names and skeletons takes up ~ 130kB .) The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1400 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4473) BUG=703750 TEST=components_unittests --gtest_filter=*IDNToUni* ==========
Description was changed from ========== Add checks against spoofing attempt at top domains Two checks are added against potential spoofing attempts. 1. Calculate the confusability skeletons of a hostname and look it up in the pre-calculated list of the skeletons of top N domains. 2. Remove diacritic marks from a hostname and compare it against the list of top N domains. Binary file size increase: ~88kB for the combined DAFSA representation of top domain names and their skeletons. (Two separate DAFSA for names and skeletons takes up ~ 130kB .) The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1400 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4473) BUG=703750 TEST=components_unittests --gtest_filter=*IDNToUni* ========== to ========== Add checks against spoofing attempt at top domains Two checks are added against potential spoofing attempts. 1. Calculate the confusability skeletons of a hostname and look it up in the pre-calculated list of the skeletons of top 10k domains. 2. Remove diacritic marks from a hostname and compare it against the list of top 10k domains. This is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Binary file size increase: ~88kB for the combined DAFSA representation of top domain names and their skeletons. (Two separate DAFSA for names and skeletons takes up ~ 130kB .) The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4534) BUG=703750 TEST=components_unittests --gtest_filter=*IDNToUni* ==========
Description was changed from ========== Add checks against spoofing attempt at top domains Two checks are added against potential spoofing attempts. 1. Calculate the confusability skeletons of a hostname and look it up in the pre-calculated list of the skeletons of top 10k domains. 2. Remove diacritic marks from a hostname and compare it against the list of top 10k domains. This is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Binary file size increase: ~88kB for the combined DAFSA representation of top domain names and their skeletons. (Two separate DAFSA for names and skeletons takes up ~ 130kB .) The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4534) BUG=703750 TEST=components_unittests --gtest_filter=*IDNToUni* ========== to ========== Add checks against spoofing attempt at top domains Two checks are added against potential spoofing attempts. 1. Calculate the confusability skeletons of a hostname and look it up in the pre-calculated list of the skeletons of top 10k domains. 2. Remove diacritic marks from a hostname and compare it against the list of top 10k domains. This is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Binary file size increase: ~88kB for the combined DAFSA representation of top domain names and their skeletons. (Two separate DAFSA for names and skeletons takes up ~ 130kB .) The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4534) To use DAFSA-related codes and tools from components/, they're moved from net/ to base/. BUG=703750 TEST=components_unittests --gtest_filter=*IDNToUni* ==========
jshin@chromium.org changed reviewers: + rsleevi@chromium.org
Ryan, can you take a look? A lot of files outside components/url_format are due to moving DAFSA-related files from net/ to base/. The actual code changes are all in omponents/url_format. I'll ask Peter later (once he's back) for an owner's review for components/url_formatter.
jshin@chromium.org changed reviewers: + pkasting@chromium.org
Adding Peter to the reviewers list now. Peter, please take a look once you're back. (NOT now :-) ).
rsleevi@chromium.org changed reviewers: + nick@chromium.org, thakis@chromium.org
Going to add two other reviewers, since I think their feedback is important enough to affect design :) Nick's been working on performance improvements for the DAFSA (yay!), and I wanted to make sure visibility into this change (since there were some CLs being worked on) is there. Nico as a base/ OWNER who Has Feelings about adding things to base. I've historically been pretty shy about it, but //net can't depend on //components (although //components can depend on net), so I'm not sure whether we should do it //net/base with a //components dependency on //net or to uplift it to //base. The DAFSA code is still very much linked to the concepts of the public suffix list, and one of the things we'd been talking about is multi-wildcards and the like, so I'm not sure where the line is between reusing vs reimplementing, given the "separate, but similar" use cases at play here. (I also haven't thought enough/reviewed enough to know whether there are more substantive feedback marks, but I did want to get some early eyes so its on the radar of folks)
And for context: If this goal is solely for .com, we could always download the .com zonefile ( https://www.verisign.com/en_US/channel-resources/domain-registry-products/zon... ) and compute skeletons for collisions, and implement a blacklist-based approach via SafeBrowsing(-like) signals on confusibility. This could reduce the data used in the binary, but be reactive blacklisting.
On 2017/04/20 20:04:50, Ryan Sleevi wrote: > Going to add two other reviewers, since I think their feedback is important > enough to affect design :) > > Nick's been working on performance improvements for the DAFSA (yay!), and I > wanted to make sure visibility into this change (since there were some CLs being > worked on) is there. > > Nico as a base/ OWNER who Has Feelings about adding things to base. I've > historically been pretty shy about it, but //net can't depend on //components > (although //components can depend on net), so I'm not sure whether we should do > it //net/base with a //components dependency on //net or to uplift it to //base. > > The DAFSA code is still very much linked to the concepts of the public suffix > list, and one of the things we'd been talking about is multi-wildcards and the > like, so I'm not sure where the line is between reusing vs reimplementing, given > the "separate, but similar" use cases at play here. > > (I also haven't thought enough/reviewed enough to know whether there are more > substantive feedback marks, but I did want to get some early eyes so its on the > radar of folks) Yup, base looks like a strange home for this to me.
Thanks, Ryan ! I didn't realize that components/ can depend on net/ (in retrospect, it's very obvious and I feel stupid not having realized it). In that case, perhaps DAFSA-related tools/codes can stay in net/. Anyway, I was about to add Nico to the reviewer list. Thank you for beating me to that :-).
On 2017/04/20 20:12:00, Nico wrote: > On 2017/04/20 20:04:50, Ryan Sleevi wrote: > > Going to add two other reviewers, since I think their feedback is important > > enough to affect design :) > > > > Nick's been working on performance improvements for the DAFSA (yay!), and I > > wanted to make sure visibility into this change (since there were some CLs > being > > worked on) is there. Thank you for adding Nick and head-up on his changes. > > Nico as a base/ OWNER who Has Feelings about adding things to base. I've > > historically been pretty shy about it, but //net can't depend on //components > > (although //components can depend on net), so I'm not sure whether we should > do > > it //net/base with a //components dependency on //net or to uplift it to > //base. > Yup, base looks like a strange home for this to me. Now that I realized that components/ can depend on net/ (which I should have from the beginning) and Nico thinks it's strange to put DAFSA stuffs in base/. I'll move them back net/. :-)
On 2017/04/20 20:07:12, Ryan Sleevi wrote: > And for context: If this goal is solely for .com, we could always download the > .com zonefile ( > https://www.verisign.com/en_US/channel-resources/domain-registry-products/zon... > ) and compute skeletons for collisions, and implement a blacklist-based approach > via SafeBrowsing(-like) signals on confusibility. This could reduce the data > used in the binary, but be reactive blacklisting. Thank you for the idea. A couple of thoughts: 1. I always wanted SafeBrowsing to combine IDN-name-similarity with their other signals and alert Chrome because it has a lot more information. It's really hard to take a right balance on the client-side with just names. The downside of that is it's reactive and cannot deal with a "targetted" attack. 2. I was using 'com' domain list as a benchmark, but '.net', '.org' and a few others need a similar 'protection'. Having said that, I realized that I can just drop domains under ccTLDs that do not allow non-ASCII-Latin names (e.g. .kr, .in, .jp). How much saving in memory footprint I'll get is not yet known.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
On 2017/04/20 20:37:29, jungshik at Google wrote: > On 2017/04/20 20:07:12, Ryan Sleevi wrote: > > And for context: If this goal is solely for .com, we could always download the > > .com zonefile ( > > > https://www.verisign.com/en_US/channel-resources/domain-registry-products/zon... > > ) and compute skeletons for collisions, and implement a blacklist-based > approach > > via SafeBrowsing(-like) signals on confusibility. This could reduce the data > > used in the binary, but be reactive blacklisting. > > Thank you for the idea. > > A couple of thoughts: > > 1. I always wanted SafeBrowsing to combine IDN-name-similarity with their other > signals and alert Chrome because it has a lot more information. > It's really hard to take a right balance on the client-side with just names. The > downside of that is it's reactive and cannot deal with > a "targetted" attack. > > 2. I was using 'com' domain list as a benchmark, but '.net', '.org' and a few > others need a similar 'protection'. Having said that, I realized that I can just > drop domains > under ccTLDs that do not allow non-ASCII-Latin names (e.g. .kr, .in, .jp). How > much saving in memory footprint I'll get is not yet known. So FYI the other big DAFSA changes I have in progress is here: https://codereview.chromium.org/2649033004/ I can easily re-base, and re-test that after this lands: it's a 2x speedup. Basically I got sideswiped by perf review season, and haven't gotten back to it -- maybe next week.
https://codereview.chromium.org/2784933002/diff/490001/components/url_formatt... File components/url_formatter/top_domains/README (right): https://codereview.chromium.org/2784933002/diff/490001/components/url_formatt... components/url_formatter/top_domains/README:5: src/tools/perf/page_sets/alexa1-10000-urls.json by running the following: IIRC the alexa10000 from page_sets was almost five years old. If we're relying on this for user protection, we'll need to set up a process to update the list. https://codereview.chromium.org/2784933002/diff/490001/components/url_formatt... components/url_formatter/top_domains/README:10: END {printf ("# for testing\ndigklmo68.com\ndigklmo68.co.uk\n");}' > \ This would probably be better as a python script, since sed is not part of our developer toolchain. https://codereview.chromium.org/2784933002/diff/490001/components/url_formatt... File components/url_formatter/top_domains/alexa_10k_names_and_skeletons.gperf (right): https://codereview.chromium.org/2784933002/diff/490001/components/url_formatt... components/url_formatter/top_domains/alexa_10k_names_and_skeletons.gperf:83: microsoft.com, 1 Here we include "microsoft.com" and "rnicrosoft.corn" but not "rnicrosoft.com". Is that a problem? https://codereview.chromium.org/2784933002/diff/490001/components/url_formatt... components/url_formatter/top_domains/alexa_10k_names_and_skeletons.gperf:84: delta-search.corn, 0 Do we need to defend against confusables in the ETLD part of a hostname (afaik you can't register a .corn website currently)?
https://codereview.chromium.org/2784933002/diff/490001/components/url_formatt... File components/url_formatter/url_formatter.cc (right): https://codereview.chromium.org/2784933002/diff/490001/components/url_formatt... components/url_formatter/url_formatter.cc:557: if (GetSkeleton(hostname, &skeleton) && LookupMatchInTopDomains(skeleton, 0)) If it's possible to efficiently obtain the skeleton (or do the diacritical removal) one input character at a time, then it might be worthwhile to use FixedSetIncrementalLookup here: we'd probably be done after one or two characters most of the time, since the first multibyte UTF-8 character would exhaust the DAFSA search. https://codereview.chromium.org/2784933002/diff/490001/components/url_formatt... components/url_formatter/url_formatter.cc:562: LookupMatchInTopDomains(accent_free_name, 1); Should we (could we) do GetSkeleton() on the RemoveDiacritics string? (Is rñicrosoft.com a concern)? If we did, then we might be able to reduce DAFSA size (only needing to store the skeletons) -- a smaller DAFSA is a faster DAFSA. Also, as above, it's worth asking if FixedSetIncrementalLookup could provide a speedup here.
https://codereview.chromium.org/2784933002/diff/490001/components/url_formatt... File components/url_formatter/top_domains/alexa_10k_names_and_skeletons.gperf (right): https://codereview.chromium.org/2784933002/diff/490001/components/url_formatt... components/url_formatter/top_domains/alexa_10k_names_and_skeletons.gperf:83: microsoft.com, 1 On 2017/04/20 22:26:59, ncarter wrote: > Here we include "microsoft.com" and "rnicrosoft.corn" but not "rnicrosoft.com". > Is that a problem? No. Entries with value '0' are skeletons while those with value '1' are original names. We calculate the skeleton of an incoming domain name and look it up in the skeleton list. We remove diacritics from an incoming domain name and look it up in the name list. The skeleton of 'com' is 'corn'. So, the skeleton of 'microsoft.com' is 'rnicrosoft.corn'. https://codereview.chromium.org/2784933002/diff/490001/components/url_formatt... components/url_formatter/top_domains/alexa_10k_names_and_skeletons.gperf:84: delta-search.corn, 0 On 2017/04/20 22:26:59, ncarter wrote: > Do we need to defend against confusables in the ETLD part of a hostname (afaik > you can't register a .corn website currently)? Again, entries with '0' is a skeleton. The skeleton of 'com' is 'corn'. Nonetheless, you do have a point about the ETLD part. So, what can be done is 1) truncate the ETLD part when generating DAFSA 2) At run-time, do the same for incoming host names BTW, it'll be interesting to see how much size saving we can get from truncating ETLD part (given the DAFSA works).
https://codereview.chromium.org/2784933002/diff/490001/components/url_formatt... File components/url_formatter/url_formatter.cc (right): https://codereview.chromium.org/2784933002/diff/490001/components/url_formatt... components/url_formatter/url_formatter.cc:557: if (GetSkeleton(hostname, &skeleton) && LookupMatchInTopDomains(skeleton, 0)) On 2017/04/20 23:37:22, ncarter wrote: > If it's possible to efficiently obtain the skeleton (or do the diacritical > removal) one input character at a time, then it might be worthwhile to use > FixedSetIncrementalLookup here: we'd probably be done after one or two > characters most of the time, since the first multibyte UTF-8 character would > exhaust the DAFSA search. Yeah, I saw that API and thought about trying one character at a time, but didn't try. AccentRemoval is rather expensive. Before switching over to DAFSA, I used sortkey with primary strength - meaning accents are ignored when generating sortkey. It's faster to calculate sortkey, but sortkey uses almost all the bytes ( 0x00 ~ 0xFF) so that DAFSA cannot be used for sortkey look-up. OTOH, not being able to use DAFSA means that I have to initialize hash table of sortkeys at runtime. Moreover, I found about dozens of domains registered in '.com' with 'base Latin letter + combining mark' (there's no precomposed form for those sequences so that normalization wouldn't get rid of them). One character at a time accent-removal would not work for them. As for the skeleton calculation, it should be (a lot) cheaper than accent-removal. again 'combining marks' above are problematic for one character at a time calculation. Perhaps, I can branch depending on whether there's a combing mark in the input. If there is, use the current approach. Otherwise, take a fast path for skeleton lookup. https://codereview.chromium.org/2784933002/diff/490001/components/url_formatt... components/url_formatter/url_formatter.cc:562: LookupMatchInTopDomains(accent_free_name, 1); On 2017/04/20 23:37:22, ncarter wrote: > Should we (could we) do GetSkeleton() on the RemoveDiacritics string? (Is > http://xn--ricrosoft-l6a.com a concern)? That's a good one !! Thank you for the suggestion. I haven't thought about it. (and ..... ick ...) In terms of performance, though, we might lose more ground by accent-removal first (accent-removal is slow.) than gaining thanks to a faster DAFSA lookup. > If we did, then we might be able to reduce DAFSA size (only needing to store the > skeletons) -- a smaller DAFSA is a faster DAFSA. Related to that is how much saving we'd get if we can have a'value-free' DAFSA (if we just store skeletons, the only thing I want is a membership. I don't care about values). > Also, as above, it's worth asking if FixedSetIncrementalLookup could provide a > speedup here. Yeah.. BTW, before trying to optimize further, I'd rather have a sense of how critical speed up is for URL formatting in omnibox and elsewhere.
On 2017/04/21 20:16:33, jungshik at Google wrote: > > Also, as above, it's worth asking if FixedSetIncrementalLookup could provide a > > speedup here. > > Yeah.. > > BTW, before trying to optimize further, I'd rather have a sense of how critical > speed up is for URL formatting in omnibox and elsewhere. That's a good question, and I'm not sure I have a good answer. Somewhat relatedly, however, https://github.com/whatwg/html/issues/2568 came up and across my GitHub monitoring. In this case, Mozilla's right now using their URL formatting (for display) algorithm for .origin events. It sounds like Boris is willing to get it off that (and onto an IDNA algorithm), but there's been a request in https://github.com/whatwg/url/issues/274 to expose "browser-compatible" display algorithms as APIs. So... somewhere in the middle? :) It's not something we're doing now, but we know people (reasonably) want :)
https://codereview.chromium.org/2784933002/diff/490001/components/url_formatt... File components/url_formatter/top_domains/alexa_10k_names_and_skeletons.gperf (right): https://codereview.chromium.org/2784933002/diff/490001/components/url_formatt... components/url_formatter/top_domains/alexa_10k_names_and_skeletons.gperf:10: // 1: original domain name. Should this file be generated at build time? I'm wondering what happens when we update ICU -- the ICU docs caution against caching skeletons. For that matter, does chrome always use a bundled copy of ICU, or does it ever use a system copy? https://codereview.chromium.org/2784933002/diff/490001/components/url_formatt... components/url_formatter/top_domains/alexa_10k_names_and_skeletons.gperf:867: google.no, 1 Cases where a string is listed twice, like google.no, probably don't actually work properly. LookupStringInFixedSet can't return both 0 and 1 as the lookup result, even if the DAFSA encodes multiple results. Instead, you should probably treat the two cases as a bitfield: 1 means skeleton, 2 means entry, and 3 means both, like google.no. This is how the effective TLD dafsa does it. We should probably have make_dafsa.py's parse_gperf function complain about duplicate keys. It may actually be emitting a dafsa that encodes both result values, but our decoder isn't set up to look at them. [ALSO: sleevi -- this would have been caught by the "decoded dafsa should be equal to the input gperf" unittest that I wanted to add!]
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
On 2017/04/21 21:35:34, ncarter wrote: > https://codereview.chromium.org/2784933002/diff/490001/components/url_formatt... > File components/url_formatter/top_domains/alexa_10k_names_and_skeletons.gperf > (right): > > https://codereview.chromium.org/2784933002/diff/490001/components/url_formatt... > components/url_formatter/top_domains/alexa_10k_names_and_skeletons.gperf:10: // > 1: original domain name. > Should this file be generated at build time? I'm wondering what happens when we > update ICU -- the ICU docs caution against caching skeletons. > > For that matter, does chrome always use a bundled copy of ICU, or does it ever > use a system copy? Google Chrome always uses a bundled ICU, but Chromium can use a system ICU. And, you're right that the skeleton can change with an ICU version even though the skeleton of ASCII letters are not likely to change (all the alexa top 10k are ASCII only except for one which I filtered out). Nonetheless, generating a gperf at runtime seems to be a good idea. I'll see how to tweak BUILD.gn for that. > https://codereview.chromium.org/2784933002/diff/490001/components/url_formatt... > components/url_formatter/top_domains/alexa_10k_names_and_skeletons.gperf:867: > google.no, 1 > Cases where a string is listed twice, like google.no, probably don't actually > work properly. LookupStringInFixedSet can't return both 0 and 1 as the lookup > result, even if the DAFSA encodes multiple results. > > Instead, you should probably treat the two cases as a bitfield: 1 means > skeleton, 2 means entry, and 3 means both, like google.no. This is how the > effective TLD dafsa does it. Thanks a lot for catching it. I'll fix that. > > We should probably have make_dafsa.py's parse_gperf function complain about > duplicate keys. It may actually be emitting a dafsa that encodes both result > values, but our decoder isn't set up to look at them. > > [ALSO: sleevi -- this would have been caught by the "decoded dafsa should be > equal to the input gperf" unittest that I wanted to add!]
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
There are a lot of reviewers on this, most of whom seem more capable than me. Do you still want my review? If yes, what specific aspects of this change do you want thoughts on?
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Description was changed from ========== Add checks against spoofing attempt at top domains Two checks are added against potential spoofing attempts. 1. Calculate the confusability skeletons of a hostname and look it up in the pre-calculated list of the skeletons of top 10k domains. 2. Remove diacritic marks from a hostname and compare it against the list of top 10k domains. This is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Binary file size increase: ~88kB for the combined DAFSA representation of top domain names and their skeletons. (Two separate DAFSA for names and skeletons takes up ~ 130kB .) The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4534) To use DAFSA-related codes and tools from components/, they're moved from net/ to base/. BUG=703750 TEST=components_unittests --gtest_filter=*IDNToUni* ========== to ========== Add checks against spoofing attempt at top domains Two checks are added against potential spoofing attempts. 1. Calculate the confusability skeletons of a hostname and look it up in the pre-calculated list of the skeletons of top 10k domains. 2. Remove diacritic marks from a hostname and compare it against the list of top 10k domains. This is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Binary file size increase: ~ 83kB for the combined DAFSA representation of top domain names and their skeletons. (Two separate DAFSA for names and skeletons takes up ~ 130kB .) The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4534) To use DAFSA-related codes and tools from components/, they're moved from net/ to base/. BUG=703750 TEST=components_unittests --gtest_filter=*IDNToUni* ==========
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
jshin@chromium.org changed reviewers: - thakis@chromium.org
Peter, I need your input on performance. On average (as tested with ~ 1 million IDNs with .com), this check increases the time to determine whether or not to display in Unicode by 7~80%. I believe URL formatting is not in a perf-critical path, but I'm not sure of its interaction with what's done in omnibox. BTW, what I haven't done yet is to see how memory (the size of DAFSA), execution time and # of IDNs flagged change with N (where N is # of top domains to include for checking against).
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
On 2017/04/25 05:28:11, jungshik at Google wrote: > Peter, I need your input on performance. On average (as tested with ~ 1 million > IDNs with .com), this check increases the time to determine whether or not to > display in Unicode by 7~80%. I believe URL formatting is not in a perf-critical > path, but I'm not sure of its interaction with what's done in omnibox. I _think_ this isn't in a perf-critical path. It's in some codepaths that look at first like they're perf-critical for the omnibox, but on further digging I think we're likely OK. However, if you land this, I would try and coordinate with mpearson on which histograms to watch to catch perf regressions in the field.
On 2017/04/25 22:04:02, Peter Kasting wrote: > On 2017/04/25 05:28:11, jungshik at Google wrote: > > Peter, I need your input on performance. On average (as tested with ~ 1 > million > > IDNs with .com), this check increases the time to determine whether or not to > > display in Unicode by 7~80%. I believe URL formatting is not in a > perf-critical > > path, but I'm not sure of its interaction with what's done in omnibox. > > I _think_ this isn't in a perf-critical path. It's in some codepaths that look > at first like they're perf-critical for the omnibox, but on further digging I > think we're likely OK. However, if you land this, I would try and coordinate > with mpearson on which histograms to watch to catch perf regressions in the > field. Thanks a lot, Peter ! I collected some data to see how the number of IDNs flagged (out of 1 million com IDNs), average runtime (for 1 million IDNs) and the size of DAFSA change with # of top domains included in the checklist. They're available at https://docs.google.com/spreadsheets/d/1hGMaEIUPRDP8Io7jIgh_lpQUrM99EpUbOphmR... The check time is almost flat (does not change with # of top domains in the DAFSA). Apparently, it's dominated by accent-removal (and skeleton calculation; I suspect accent-removal takes up most of time). Given what Peter wrote (not a perf-critical path) and the above finding, the only thing remaining to consider is the memory impact. If you use top 10k, it's 83kB. With top 500, it's ~ 4kB. With top 3000, it's ~27kB. Given the slope of the memory vs # of flagged domains, I'm tempted to cut off at 3,000 (27kB memory / binary size impact). I wonder what others think of that cut-off.
On 2017/04/26 00:30:50, jungshik at Google wrote: > On 2017/04/25 22:04:02, Peter Kasting wrote: > > On 2017/04/25 05:28:11, jungshik at Google wrote: > > > Peter, I need your input on performance. On average (as tested with ~ 1 > > million > > > IDNs with .com), this check increases the time to determine whether or not > to > > > display in Unicode by 7~80%. I believe URL formatting is not in a > > perf-critical > > > path, but I'm not sure of its interaction with what's done in omnibox. > > > > I _think_ this isn't in a perf-critical path. It's in some codepaths that > look > > at first like they're perf-critical for the omnibox, but on further digging I > > think we're likely OK. However, if you land this, I would try and coordinate > > with mpearson on which histograms to watch to catch perf regressions in the > > field. > > Thanks a lot, Peter ! > > I collected some data to see how the number of IDNs flagged (out of 1 million > com IDNs), average runtime (for 1 million > IDNs) and the size of DAFSA change with # of top domains included in the > checklist. They're available at > > > https://docs.google.com/spreadsheets/d/1hGMaEIUPRDP8Io7jIgh_lpQUrM99EpUbOphmR... > > The check time is almost flat (does not change with # of top domains in the > DAFSA). Apparently, it's dominated by accent-removal (and > skeleton calculation; I suspect accent-removal takes up most of time). > > Given what Peter wrote (not a perf-critical path) and the above finding, the > only thing remaining to consider is the memory impact. > > If you use top 10k, it's 83kB. With top 500, it's ~ 4kB. With top 3000, it's > ~27kB. Given the slope of the memory vs # of flagged domains, > I'm tempted to cut off at 3,000 (27kB memory / binary size impact). > > I wonder what others think of that cut-off. Because 'reject' is the common result of the DAFSA lookup, it's likely only going to explore the first few nodes. In fact, if the first character of the query string is a multibyte UTF-8 character, we'll never even dereference a single byte of the DAFSA at all. I expect that to be true for most IDNs -- maybe less so for cyrillic since it has a lot of ascii-confusable characters. If that's right, I would expect the DAFSA part of the performance to be near constant as the dafsa grows in size. As you add bigrams, the consecutive offsets of the first node will get more spaced out in memory so scanning the node will eventually hit more and more cache lines. Our current DAFSA layout is optimized for space, at some cost to scanning efficiency. For the skeleton analysis, I wonder: would we gain anything by excluding domains that provably have no single-script skeleton equivalent? I'm not saying I know how to compute this, but I believe it is be computable. For example, http://unicode.org/cldr/utility/confusables.jsp?a=abcdefghijklmnop-dz&r=IDNA2008 suggests that lowercase k doesn't have any confusables.
the dafsa usage seems correct, so lgtm from that perspective https://codereview.chromium.org/2784933002/diff/610001/components/url_formatt... File components/url_formatter/url_formatter.cc (right): https://codereview.chromium.org/2784933002/diff/610001/components/url_formatt... components/url_formatter/url_formatter.cc:214: // similar to one of top N domains (N=500). Two checks are done: The (N=500) comment seems likely to fall out of date. Might be better to just provide a reference to the input file? https://codereview.chromium.org/2784933002/diff/610001/components/url_formatt... components/url_formatter/url_formatter.cc:219: // top N domains. Should this document what happens if |hostname| is, without modification, one of the top N domains? https://codereview.chromium.org/2784933002/diff/610001/components/url_formatt... components/url_formatter/url_formatter.cc:526: base::SPLIT_WANT_ALL); Is it possible for hostname to end in ".", or has that already been stripped? If the trailing dot is preserved to this point, then that seems like a way to defeat the spoof detection.
Thank you for the explanation and suggestion. > On 2017/04/26 00:30:50, jungshik at Google wrote: > https://docs.google.com/spreadsheets/d/1hGMaEIUPRDP8Io7jIgh_lpQUrM99EpUbOphmR... > > > > The check time is almost flat (does not change with # of top domains in the > > DAFSA). Apparently, it's dominated by accent-removal (and > > skeleton calculation; I suspect accent-removal takes up most of time). > > > > Given what Peter wrote (not a perf-critical path) and the above finding, the > > only thing remaining to consider is the memory impact. > > > > If you use top 10k, it's 83kB. With top 500, it's ~ 4kB. With top 3000, it's > > ~27kB. Given the slope of the memory vs # of flagged domains, > > I'm tempted to cut off at 3,000 (27kB memory / binary size impact). > > > > I wonder what others think of that cut-off. > > Because 'reject' is the common result of the DAFSA lookup, it's likely only > going to explore the first few nodes. In fact, if the first character of the > query string is a multibyte UTF-8 character, we'll never even dereference a > single byte of the DAFSA at all. I expect that to be true for most IDNs -- maybe > less so for cyrillic since it has a lot of ascii-confusable characters. > If that's right, I would expect the DAFSA part of the performance to be near > constant as the dafsa grows in size. As you add bigrams, the consecutive offsets > of the first node will get more spaced out in memory so scanning the node will > eventually hit more and more cache lines. Our current DAFSA layout is optimized > for space, at some cost to scanning efficiency. > > For the skeleton analysis, I wonder: would we gain anything by excluding domains > that provably have no single-script skeleton equivalent? I'm not saying I know > how to compute this, but I believe it is be computable. For example, > http://unicode.org/cldr/utility/confusables.jsp?a=abcdefghijklmnop-dz&r=IDNA2008 > suggests that lowercase k doesn't have any confusables. You want to try it for the space efficiency, don't you? (As we discussed, there's little worry about speed because 1) DAFSA scanning contributes very little to perf 2) the code in question is not in a perf-critical path.) To compute that, I need to scrape the Unicode confusables data [1] (there's no ICU API to get the list of confusables for a given character) and traverse over all possible confusables combinations for a given label to see if any of them is a single script (i.e. if any of them would not be flagged as mixed-script). That will be interesting, but I don't expect a drastic reduction. [1] http://www.unicode.org/Public/security/9.0.0/confusables.txt : has 736 entries for LDH in ASCII, but a lot of them are outside the allowed character set.
Thank you for LGTM. Addressed the comments in the latest PS. https://codereview.chromium.org/2784933002/diff/610001/components/url_formatt... File components/url_formatter/url_formatter.cc (right): https://codereview.chromium.org/2784933002/diff/610001/components/url_formatt... components/url_formatter/url_formatter.cc:214: // similar to one of top N domains (N=500). Two checks are done: On 2017/04/26 18:46:29, ncarter wrote: > The (N=500) comment seems likely to fall out of date. Might be better to just > provide a reference to the input file? Done. https://codereview.chromium.org/2784933002/diff/610001/components/url_formatt... components/url_formatter/url_formatter.cc:219: // top N domains. On 2017/04/26 18:46:29, ncarter wrote: > Should this document what happens if |hostname| is, without modification, one of > the top N domains? Non-IDN hostnames will not reach here (they're not subject to this check.). I can document that. https://codereview.chromium.org/2784933002/diff/610001/components/url_formatt... components/url_formatter/url_formatter.cc:526: base::SPLIT_WANT_ALL); On 2017/04/26 18:46:29, ncarter wrote: > Is it possible for hostname to end in ".", or has that already been stripped? If > the trailing dot is preserved to this point, then that seems like a way to > defeat the spoof detection. That's taken care of in the loop in the caller (see lines 271 to 298 in IDNToUnicodeWithAdjustments). So, there will not be any trailing dot here.
On 2017/04/26 19:12:38, jungshik at Google wrote: > Thank you for the explanation and suggestion. > > > On 2017/04/26 00:30:50, jungshik at Google wrote: > > > > https://docs.google.com/spreadsheets/d/1hGMaEIUPRDP8Io7jIgh_lpQUrM99EpUbOphmR... > > > > > > The check time is almost flat (does not change with # of top domains in the > > > DAFSA). Apparently, it's dominated by accent-removal (and > > > skeleton calculation; I suspect accent-removal takes up most of time). > > > > > > Given what Peter wrote (not a perf-critical path) and the above finding, the > > > only thing remaining to consider is the memory impact. > > > > > > If you use top 10k, it's 83kB. With top 500, it's ~ 4kB. With top 3000, it's > > > ~27kB. Given the slope of the memory vs # of flagged domains, > > > I'm tempted to cut off at 3,000 (27kB memory / binary size impact). > > > > > > I wonder what others think of that cut-off. > > > > Because 'reject' is the common result of the DAFSA lookup, it's likely only > > going to explore the first few nodes. In fact, if the first character of the > > query string is a multibyte UTF-8 character, we'll never even dereference a > > single byte of the DAFSA at all. I expect that to be true for most IDNs -- > maybe > > less so for cyrillic since it has a lot of ascii-confusable characters. > > > > If that's right, I would expect the DAFSA part of the performance to be near > > constant as the dafsa grows in size. As you add bigrams, the consecutive > offsets > > of the first node will get more spaced out in memory so scanning the node will > > eventually hit more and more cache lines. Our current DAFSA layout is > optimized > > for space, at some cost to scanning efficiency. > > > > For the skeleton analysis, I wonder: would we gain anything by excluding > domains > > that provably have no single-script skeleton equivalent? I'm not saying I know > > how to compute this, but I believe it is be computable. For example, > > > http://unicode.org/cldr/utility/confusables.jsp?a=abcdefghijklmnop-dz&r=IDNA2008 > > suggests that lowercase k doesn't have any confusables. > > You want to try it for the space efficiency, don't you? (As we discussed, > there's little worry about speed because 1) DAFSA scanning contributes very > little to perf 2) the code in question is not > in a perf-critical path.) > > To compute that, I need to scrape the Unicode confusables data [1] > (there's no ICU API to get the list of confusables for a given character) and > traverse over all possible confusables combinations for a given label to see if > any of > them is a single script (i.e. if any of them would not be flagged as > mixed-script). That will be interesting, but I don't expect a drastic reduction. > > > [1] http://www.unicode.org/Public/security/9.0.0/confusables.txt : has 736 > entries for LDH in ASCII, but a lot of them are outside the allowed character > set. Right, I was only concerned about space (or really, about coverage: with a given space budget, if we can exclude some useless entries, that makes room for other vulnerable sites to take their place). But if pretty much every entry is potentially spoofable via diacriticals, that's not too much DAFSA size to shed, and the idea probably isn't worth exploring. FWIW, I agree, I see no simple ICU way to explore the mapping of skeleton to single-script confusable string either. And if we did build such a tool, it would have some potentially nefarious uses.
Description was changed from ========== Add checks against spoofing attempt at top domains Two checks are added against potential spoofing attempts. 1. Calculate the confusability skeletons of a hostname and look it up in the pre-calculated list of the skeletons of top 10k domains. 2. Remove diacritic marks from a hostname and compare it against the list of top 10k domains. This is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Binary file size increase: ~ 83kB for the combined DAFSA representation of top domain names and their skeletons. (Two separate DAFSA for names and skeletons takes up ~ 130kB .) The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4534) To use DAFSA-related codes and tools from components/, they're moved from net/ to base/. BUG=703750 TEST=components_unittests --gtest_filter=*IDNToUni* ========== to ========== Add checks against spoofing attempt at top domains Remove diacritic marks from a hostname and calculate the confusability skeleton of the accent-free name. Look it up in the pre-calculated list of the skeletons of top 10k domains. Removing diacritic marks from a hostname is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Binary file size increase: ~ 59kB for the DAFSA representation of top domain name skeletons. The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4595) In addition, disallow combining diarctic marks unless they're preceded by Latin-Greek-Cyrillic. BUG=703750 TEST=components_unittests --gtest_filter=*IDNToUni* ==========
Description was changed from ========== Add checks against spoofing attempt at top domains Remove diacritic marks from a hostname and calculate the confusability skeleton of the accent-free name. Look it up in the pre-calculated list of the skeletons of top 10k domains. Removing diacritic marks from a hostname is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Binary file size increase: ~ 59kB for the DAFSA representation of top domain name skeletons. The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4595) In addition, disallow combining diarctic marks unless they're preceded by Latin-Greek-Cyrillic. BUG=703750 TEST=components_unittests --gtest_filter=*IDNToUni* ========== to ========== Add checks against spoofing attempt at top domains Remove diacritic marks from a hostname and calculate the confusability skeleton of the accent-free name. Look it up in the pre-calculated list of the skeletons of top 10k domains. Removing diacritic marks from a hostname is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Also add two more mappings (ӏ -> l; к -> k) to supplement the Unicode's confusables list. Binary file size increase: ~ 59kB for the DAFSA representation of top domain name skeletons. The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4595) In addition, disallow combining diarctic marks unless they're preceded by Latin-Greek-Cyrillic. BUG=703750,714628 TEST=components_unittests --gtest_filter=*IDNToUni* ==========
On 2017/04/26 20:20:40, ncarter wrote: > On 2017/04/26 19:12:38, jungshik at Google wrote: > > Thank you for the explanation and suggestion. > > You want to try it for the space efficiency, don't you? (As we discussed, > > there's little worry about speed because 1) DAFSA scanning contributes very > > little to perf 2) the code in question is not > > in a perf-critical path.) > Right, I was only concerned about space (or really, about coverage: with a > given space budget, if we can exclude some useless entries, that makes room > for other vulnerable sites to take their place). But if pretty much > every entry is potentially spoofable via diacriticals, that's not too much > DAFSA size to shed, and the idea probably isn't worth exploring. > > FWIW, I agree, I see no simple ICU way to explore the mapping of skeleton > to single-script confusable string either. And if we did build such a tool, > it would have some potentially nefarious uses. Thank you for the clarification. I agree that it'd not be wise to publish that, but somebody apparently did on github. Before moving forward, I wanted to make sure that using Alexa list is ok and it's taken care of. In the meantime (while waiting for the answer to the above), bug 714628 came up and made me revise this CL. Basically, I follow your earlier proposal to remove diacritics before calculating the skeleton. After that, the look-up is only done in the skeletons of top domains (instead of two separate look-ups). That cuts down the data size by 23kB for top 10k domains (83kB -> ~60kB). The performance is about 10% better (perhaps not mainly because of savings in look-up but because of some other accompanied changes, I guess). An additional benefit is that the code got simpler. https://docs.google.com/spreadsheets/d/1hGMaEIUPRDP8Io7jIgh_lpQUrM99EpUbOphmR... : PS37 tab has the information. Anyway, can you take another look? Ryan, can you also take a look? We also have to decide how many top domains to include in the list? As shown in the spreadsheet, adding more domains (increasing the DAFSA size) has diminishing return.
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Description was changed from ========== Add checks against spoofing attempt at top domains Remove diacritic marks from a hostname and calculate the confusability skeleton of the accent-free name. Look it up in the pre-calculated list of the skeletons of top 10k domains. Removing diacritic marks from a hostname is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Also add two more mappings (ӏ -> l; к -> k) to supplement the Unicode's confusables list. Binary file size increase: ~ 59kB for the DAFSA representation of top domain name skeletons. The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4595) In addition, disallow combining diarctic marks unless they're preceded by Latin-Greek-Cyrillic. BUG=703750,714628 TEST=components_unittests --gtest_filter=*IDNToUni* ========== to ========== Add checks against spoofing attempt at top domains Remove diacritic marks from a hostname and calculate the confusability skeleton of the accent-free name. Look it up in the pre-calculated list of the skeletons of top 10k domains. Removing diacritic marks from a hostname is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Also add two more mappings ([кĸκ] > k, п > n) to supplement the Unicode's confusables list. Binary file size increase: ~ 59kB for the DAFSA representation of top domain name skeletons. The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4571) In addition, disallow combining diarctic marks unless they're preceded by Latin-Greek-Cyrillic. BUG=703750,714628 TEST=components_unittests --gtest_filter=*IDNToUni* ==========
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_tsan_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
Description was changed from ========== Add checks against spoofing attempt at top domains Remove diacritic marks from a hostname and calculate the confusability skeleton of the accent-free name. Look it up in the pre-calculated list of the skeletons of top 10k domains. Removing diacritic marks from a hostname is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Also add two more mappings ([кĸκ] > k, п > n) to supplement the Unicode's confusables list. Binary file size increase: ~ 59kB for the DAFSA representation of top domain name skeletons. The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4571) In addition, disallow combining diarctic marks unless they're preceded by Latin-Greek-Cyrillic. BUG=703750,714628 TEST=components_unittests --gtest_filter=*IDNToUni* ========== to ========== Add checks against spoofing attempt at top domains Remove diacritic marks from a hostname and calculate the confusability skeleton of the accent-free name. Look it up in the pre-calculated list of the skeletons of top 10k domains. Removing diacritic marks from a hostname is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Also add two more mappings ([кĸκ] > k, п > n) to supplement the Unicode's confusables list. Binary file size increase: ~ 59kB for the DAFSA representation of top domain name skeletons. The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4571) In addition, disallow combining diarctic marks unless they're preceded by Latin-Greek-Cyrillic. BUG=703750,714628,719199 TEST=components_unittests --gtest_filter=*IDNToUni* ==========
Peter, I also need your owner review/approval as well. Thank you, This is not perfect and I have some reservation about this, but I think it's good enough as an interim (currently practical) solution (especially given recent attention to this issue). ------------ additional notes: Eventually, it's likely that we have to utilize history (as Ryan suggested a while ago and was also mentioned in https://bugs.chromium.org/p/chromium/issues/detail?id=711772#c22 ; I agree to this comment in spirit.
Oops. The beginning of my reply is lost in copy'n'paste. Ryan, can you take a look ? Thanks On 2017/05/08 23:36:01, jungshik at Google wrote: > Peter, I also need your owner review/approval as well. > > Thank you, > > This is not perfect and I have some reservation about this, but I think it's > good enough as an interim (currently practical) solution (especially given > recent attention to this issue). > > ------------ > additional notes: > Eventually, it's likely that we have to utilize history (as Ryan suggested a > while ago and was also mentioned in > > https://bugs.chromium.org/p/chromium/issues/detail?id=711772#c22 ; I agree to > this comment in spirit.
rsleevi@chromium.org changed reviewers: + brettw@chromium.org, emilyschechter@chromium.org
I'm not sure I feel qualified enough to review this from a policy side, and I think Peter's probably plenty coverage enough from a technical side. I've added Brett for the GN changes (and as he's a backup OWNER), but my understanding is that your goal is to have this checked in as a single binary (like tld_cleanup) and run manually whenever the tests Alexa Top 10K changes. I added Emily to make sure it's got the visibility - from all the threads, I couldn't quite figure out who 'owned' the decision, but if anyone knows, it'll be Emily. I chatted with palmer@ to make sure it wasn't him on a technical front, and he confirmed. So: - Peter for impl - Brett for GN best practices - Emily for go/no-go I think that covers it? :) https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... File components/url_formatter/top_domains/BUILD.gn (right): https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/BUILD.gn:21: executable("make_top_domain_gperf") { I would defer to brettw here whether there's any toolchain tricks needed here to make sure it generates something that will run on the host architecture during cross-compilation to generate the target dataset... Or is this just intended to be run manually? https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... File components/url_formatter/top_domains/make_alexa_top_list.py (right): https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_alexa_top_list.py:21: "page_sets", "alexa1-10000-urls.json") It seems like this should be an input parameter to the script, so that the dependency can be captured by BUILD.gn Did I miss a reply to the concerns previously raised that this list is very out of date? You may wish to check with OSPO on this, since it'll be shipped in the Chrome binary, which is different than how it's being used today. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_alexa_top_list.py:46: # Add some popular domains if they're missing. I'm not sure why this part - it seems to be more subjective?
https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... File components/url_formatter/top_domains/README (right): https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/README:2: It is an input to make_top_domain_list and is made up of list of Alexa Nit: "The Alexa top 10k domains, one per line, constructed by running make_alexa_top_list.py. Used as an input to make_top_domain_gperf." I assume that "list" here and below should have been "gperf". Also, be consistent about whether or not to use a newline between the bulleted name above and the explanation here. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/README:9: It is generated by running make_top_domain_list and checked in. Nit: "The checked-in output of make_top_domain_gperf. Processed during the build to generate alexa_names_and_skeletons-inc.cc, which is used by url_formatter.cc. This must be regenerated as follows if ICU is updated, since skeletons can differ across ICU versions: <instructions>" This makes me wonder if there should be a README note in the ICU directory that says to regenerate this. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... File components/url_formatter/top_domains/make_alexa_top_list.py (right): https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_alexa_top_list.py:3: # # Use of this source code is governed by a BSD-style license that can be Nit: No need for # #? (2 places) https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_alexa_top_list.py:6: """Generate alexa_domains.list from Nit: Generates? https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... File components/url_formatter/top_domains/make_top_domain_gperf.cc (right): https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_top_domain_gperf.cc:44: std::cerr << "failed to write to " << path.AsUTF8Unsafe() Nit: Initial caps? (several places) https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_top_domain_gperf.cc:48: return true; Nit: Shorter: bool succeeded = base::WriteFile(path, content.data(), content.size()) != -1; if (!succeeded) std::cerr << "failed to write to " << path.AsUTF8Unsafe() << '\n'; return succeeded; ...though I wonder if "!= -1" should be a check that you actually wrote the full size expected. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_top_domain_gperf.cc:53: std::cerr << "Generate the list of top domain skeletons to use as\n" Nit: Generates Seems like this could be wrapped closer to 80 columns of output too. Something like this (which also takes into account the next two comments): std::cerr << "Generates the list of top domain skeletons to use as input to" "\nbase/dafsa/make_dafsa.py.\nUsage: " << argv[0] << '\n'; https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_top_domain_gperf.cc:54: << "input to base/dafsa/make_dafsa.py\n"; Nit: Leading << not necessary when continuing previous string constant (several places) https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_top_domain_gperf.cc:55: std::cerr << "Usage: " << argv[0] << '\n'; Nit: Why not just continue << from the previous statement? https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_top_domain_gperf.cc:79: std::stringstream output; As far as I can tell, you just append unformatted strings to this. Any reason not to just use a string and += directly? https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_top_domain_gperf.cc:89: << "confusability check.\n" Nit: "...for the confusability check in <sourcefile name>" (or <function_name>)? https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_top_domain_gperf.cc:94: std::string domains_with_max_labels; Nit: domain, singular? https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_top_domain_gperf.cc:111: Nit: No blank line https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... File components/url_formatter/url_formatter.cc (right): https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:203: class IDNSpoofChecker { Nit: It might be nice to pull this class out to its own .h/.cc for maximum readability. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:207: // Returns true if |label| is safe to display as Unicode. When the TLD is Nit: Does the second sentence here really need to be here? It seems like it only describes a portion of the functionality of the function. Maybe we should just say "See the function body for details on the specific safety checks performed"? https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:211: bool Check(base::StringPiece16 label, bool is_tld_ascii); Nit: This is a poor function name; how about something like SafeToDisplayAsUnicode()? https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:224: void SetAllowedUnicodeSet(UErrorCode* status); Nit: I suggest adding comments for these even though they're private. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:245: void OnThreadTermination(void* regex_matcher) { Let me guess: the RegexMatcher uses internal state, so it's not possible to simultaneously use it from multiple threads, hence the need to stick it in TLS. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:270: bool has_idn_component = false; Can we reach this function with an input that doesn't cause the loop below to set this to true? It seems unlikely. If not, we could eliminate this variable. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:286: has_idn_component = has_idn_component || converted_idn; Nit: Or use |= https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:300: if (has_idn_component && Nit: Might want a comment above this block like "Leave as punycode any inputs that spoof top domains." https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:305: } Nit: Blank line after this? https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:392: // TODO(jshin): Revisit "ł > l; ø > o" mapping. Nit: Might want to link this TODO to a bug or otherwise expand on why/how you'd revisit, or this won't be actionable by others. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:402: if (U_FAILURE(status)) Do not handle DCHECK failure; assume DCHECKs cannot fail. If they can, they should be conditionals, not DCHECKs. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:436: // - it's made entirely of Cyrillic letters that look like Latin letters. Nit: it's -> the TLD is ASCII, and the input is ? https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:454: // Note that non-ASCII Latin check should not be applied when the entire label Nit: that -> that the ? https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:515: const size_t kNumberOfLabelsToCheck = 3; Can we write this value into the file so we don't need to hardcode it here? https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:517: bool LookupStringInSet(base::StringPiece needle, Nit: If you're not going to use boring names for your params, I'd copy the ones from the underlying net:: declaration rather than using |needle|. That said, this wrapper is so short, and is called only once, that I'd just inline the body of this at the callsite below. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:528: DCHECK(hostname[hostname.length() - 1] != '.'); Nit: hostname.back() https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:533: labels.erase(labels.begin()); Nit: Seems like a single call to vector::erase could be more efficient than a while loop. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:535: while (labels.size() > 1) { Is this naive loop faster than computing the actual eTLD+1 length using the RCDS and then doing a single DAFSA lookup? https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:546: (*(hostname.rbegin()) == '.' ? 1 : 0); Nit: Use .back() instead of *rbegin() https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:552: ustr_host.length() && transliterator_) Note that if the DCHECK earlier is assumed not to fail, this null-check can disappear. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:557: uspoof_getSkeletonUnicodeString(checker_, 0, /* not used. deprecated. */ Nit: If you're going to add /* */ (which I'm not sure is necessary), do so before the comma to make it very clear which parameter this is on. "deprecated." is also probably unnecessary here. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... File components/url_formatter/url_formatter_unittest.cc (right): https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter_unittest.cc:286: // 'digklmo68.co.uk" are listed for unittest in the top domain list. Can we avoid including these test domains in the real list, and instead cause the test to poke them into the list while it's running?
https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... File components/url_formatter/top_domains/BUILD.gn (right): https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/BUILD.gn:21: executable("make_top_domain_gperf") { On 2017/05/09 00:00:29, Ryan Sleevi wrote: > I would defer to brettw here whether there's any toolchain tricks needed here to > make sure it generates something that will run on the host architecture during > cross-compilation to generate the target dataset... > > Or is this just intended to be run manually? Yes, that's to be run manually. That's why is_ios and is_android are excluded. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... File components/url_formatter/top_domains/make_alexa_top_list.py (right): https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_alexa_top_list.py:21: "page_sets", "alexa1-10000-urls.json") On 2017/05/09 00:00:29, Ryan Sleevi wrote: > It seems like this should be an input parameter to the script, so that the > dependency can be captured by BUILD.gn > > Did I miss a reply to the concerns previously raised that this list is very out > of date? > > You may wish to check with OSPO on this, since it'll be shipped in the Chrome > binary, which is different than how it's being used today. I checked with Chrome counsel (which is why I was 'silent' for a while on this CL). Using the list (outdated/old) already in the chromium tree was green-lighted. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_alexa_top_list.py:46: # Add some popular domains if they're missing. On 2017/05/09 00:00:29, Ryan Sleevi wrote: > I'm not sure why this part - it seems to be more subjective? Yes, it's subjective. Because the list is old, some newer domains are not in the list. And, if the cut-off is < 6000, gmail and hotmail wouldn't make the list, which is why I'm adding them manually. 360, ntd, onckds are from the publicly available Alexa top 50 list.
https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... File components/url_formatter/top_domains/make_alexa_top_list.py (right): https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_alexa_top_list.py:21: "page_sets", "alexa1-10000-urls.json") On 2017/05/09 00:00:29, Ryan Sleevi wrote: > It seems like this should be an input parameter to the script, so that the > dependency can be captured by BUILD.gn This is not run during build. It's run manually to generate alexa_domains.list. Then, alexa_dmains.list is checked in. Also, make_topdomain_gperf is run manually to generate alexa_*.gperf. That file is also checked in. (See README in the directory). During build, make_dafsa.py is run to generate alexa...-inc.cc. That last step is captured in BUILD.gn. This is very similar to the way eTLD list is handled/updated. Perhaps, it's possible to make the first two steps happen during build time, too, but I didn't (my GN foo is not strong enough and I thought it's good enough given the way eTLD list is handled).
https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... File components/url_formatter/top_domains/make_alexa_top_list.py (right): https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_alexa_top_list.py:21: "page_sets", "alexa1-10000-urls.json") On 2017/05/09 19:57:39, jungshik at Google wrote: > On 2017/05/09 00:00:29, Ryan Sleevi wrote: > > It seems like this should be an input parameter to the script, so that the > > dependency can be captured by BUILD.gn > > > > Did I miss a reply to the concerns previously raised that this list is very > out > > of date? > > > > You may wish to check with OSPO on this, since it'll be shipped in the Chrome > > binary, which is different than how it's being used today. > > I checked with Chrome counsel (which is why I was 'silent' for a while on this > CL). Using the list (outdated/old) already in the chromium tree was > green-lighted. Separately, are we going to be updating that list? It would be nice to have a more up-to-date list which obviates the need for the manual additions below. If this will be in the tree long-term, we also need a process to ensure it's regularly maintained, including a person or team who will be on the hook to maintain it.
Addressed most of Peter's comments in the latest PS https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... File components/url_formatter/top_domains/README (right): https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/README:2: It is an input to make_top_domain_list and is made up of list of Alexa On 2017/05/09 01:37:02, Peter Kasting wrote: > Nit: "The Alexa top 10k domains, one per line, constructed by running > make_alexa_top_list.py. Used as an input to make_top_domain_gperf." > > I assume that "list" here and below should have been "gperf". > > Also, be consistent about whether or not to use a newline between the bulleted > name above and the explanation here. Thanks. Done https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/README:9: It is generated by running make_top_domain_list and checked in. On 2017/05/09 01:37:02, Peter Kasting wrote: > Nit: "The checked-in output of make_top_domain_gperf. Processed during the > build to generate alexa_names_and_skeletons-inc.cc, which is used by > url_formatter.cc. This must be regenerated as follows if ICU is updated, since > skeletons can differ across ICU versions: <instructions>" Thank you for a much better rewrite. > This makes me wonder if there should be a README note in the ICU directory that > says to regenerate this. I'll add it to README.chromium in third_party/icu. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... File components/url_formatter/top_domains/make_alexa_top_list.py (right): https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_alexa_top_list.py:3: # # Use of this source code is governed by a BSD-style license that can be On 2017/05/09 01:37:02, Peter Kasting wrote: > Nit: No need for # #? (2 places) Done. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_alexa_top_list.py:6: """Generate alexa_domains.list from On 2017/05/09 01:37:02, Peter Kasting wrote: > Nit: Generates? Done. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_alexa_top_list.py:21: "page_sets", "alexa1-10000-urls.json") On 2017/05/09 20:50:14, Peter Kasting wrote: > On 2017/05/09 19:57:39, jungshik at Google wrote: > > On 2017/05/09 00:00:29, Ryan Sleevi wrote: > > > It seems like this should be an input parameter to the script, so that the > > > dependency can be captured by BUILD.gn > > > > > > Did I miss a reply to the concerns previously raised that this list is very > > out > > > of date? > > > > > > You may wish to check with OSPO on this, since it'll be shipped in the > Chrome > > > binary, which is different than how it's being used today. > > > > I checked with Chrome counsel (which is why I was 'silent' for a while on this > > CL). Using the list (outdated/old) already in the chromium tree was > > green-lighted. > > Separately, are we going to be updating that list? It would be nice to have a > more up-to-date list which obviates the need for the manual additions below. If > this will be in the tree long-term, we also need a process to ensure it's > regularly maintained, including a person or team who will be on the hook to > maintain it. That's a weak link because up-to-date Alexa list is not public. (only the top 50 list is public). I wish there were an alternative public list. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... File components/url_formatter/top_domains/make_top_domain_gperf.cc (right): https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_top_domain_gperf.cc:44: std::cerr << "failed to write to " << path.AsUTF8Unsafe() On 2017/05/09 01:37:02, Peter Kasting wrote: > Nit: Initial caps? (several places) Done. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_top_domain_gperf.cc:48: return true; On 2017/05/09 01:37:02, Peter Kasting wrote: > Nit: Shorter: > > bool succeeded = base::WriteFile(path, content.data(), content.size()) != -1; > if (!succeeded) > std::cerr << "failed to write to " << path.AsUTF8Unsafe() << '\n'; > return succeeded; > > ...though I wonder if "!= -1" should be a check that you actually wrote the full > size expected. Ok. Turned it into CHECK. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_top_domain_gperf.cc:53: std::cerr << "Generate the list of top domain skeletons to use as\n" On 2017/05/09 01:37:03, Peter Kasting wrote: > Nit: Generates > > Seems like this could be wrapped closer to 80 columns of output too. Something > like this (which also takes into account the next two comments): > > std::cerr << "Generates the list of top domain skeletons to use as input to" > "\nbase/dafsa/make_dafsa.py.\nUsage: " << argv[0] << '\n'; Done. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_top_domain_gperf.cc:54: << "input to base/dafsa/make_dafsa.py\n"; On 2017/05/09 01:37:02, Peter Kasting wrote: > Nit: Leading << not necessary when continuing previous string constant (several > places) Done. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_top_domain_gperf.cc:55: std::cerr << "Usage: " << argv[0] << '\n'; On 2017/05/09 01:37:02, Peter Kasting wrote: > Nit: Why not just continue << from the previous statement? Done. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_top_domain_gperf.cc:79: std::stringstream output; On 2017/05/09 01:37:02, Peter Kasting wrote: > As far as I can tell, you just append unformatted strings to this. Any reason > not to just use a string and += directly? You.re right. Thank you for catching it. At first, it's an ostream to a file which I later turned to stringstream without thinking much when rewriting to use base::Path, etc. Changed to std::string and "+=". https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_top_domain_gperf.cc:89: << "confusability check.\n" On 2017/05/09 01:37:02, Peter Kasting wrote: > Nit: "...for the confusability check in <sourcefile name>" (or <function_name>)? Done. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_top_domain_gperf.cc:94: std::string domains_with_max_labels; On 2017/05/09 01:37:02, Peter Kasting wrote: > Nit: domain, singular? Done. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/top_domains/make_top_domain_gperf.cc:111: On 2017/05/09 01:37:02, Peter Kasting wrote: > Nit: No blank line Done. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... File components/url_formatter/url_formatter.cc (right): https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:203: class IDNSpoofChecker { On 2017/05/09 01:37:03, Peter Kasting wrote: > Nit: It might be nice to pull this class out to its own .h/.cc for maximum > readability. Ok. pulled it out. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:207: // Returns true if |label| is safe to display as Unicode. When the TLD is On 2017/05/09 01:37:03, Peter Kasting wrote: > Nit: Does the second sentence here really need to be here? It seems like it > only describes a portion of the functionality of the function. Maybe we should > just say "See the function body for details on the specific safety checks > performed"? Yeah, that's better. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:211: bool Check(base::StringPiece16 label, bool is_tld_ascii); On 2017/05/09 01:37:03, Peter Kasting wrote: > Nit: This is a poor function name; how about something like > SafeToDisplayAsUnicode()? Done. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:224: void SetAllowedUnicodeSet(UErrorCode* status); On 2017/05/09 01:37:03, Peter Kasting wrote: > Nit: I suggest adding comments for these even though they're private. Done. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:270: bool has_idn_component = false; On 2017/05/09 01:37:03, Peter Kasting wrote: > Can we reach this function with an input that doesn't cause the loop below to > set this to true? It seems unlikely. If not, we could eliminate this variable. IDNToUnicode (which calls this function) is called with any host name out of GURL. So, has_idn_component can be false for all labels/components. Then, has_idn_component would be false, too. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:286: has_idn_component = has_idn_component || converted_idn; On 2017/05/09 01:37:03, Peter Kasting wrote: > Nit: Or use |= Changed. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:300: if (has_idn_component && On 2017/05/09 01:37:03, Peter Kasting wrote: > Nit: Might want a comment above this block like "Leave as punycode any inputs > that spoof top domains." Done. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:305: } On 2017/05/09 01:37:03, Peter Kasting wrote: > Nit: Blank line after this? Done. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:402: if (U_FAILURE(status)) On 2017/05/09 01:37:03, Peter Kasting wrote: > Do not handle DCHECK failure; assume DCHECKs cannot fail. If they can, they > should be conditionals, not DCHECKs. Done. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:436: // - it's made entirely of Cyrillic letters that look like Latin letters. On 2017/05/09 01:37:03, Peter Kasting wrote: > Nit: it's -> the TLD is ASCII, and the input is ? Done. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:454: // Note that non-ASCII Latin check should not be applied when the entire label On 2017/05/09 01:37:03, Peter Kasting wrote: > Nit: that -> that the ? Done. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:515: const size_t kNumberOfLabelsToCheck = 3; On 2017/05/09 01:37:03, Peter Kasting wrote: > Can we write this value into the file so we don't need to hardcode it here? make_top_domain_gperf can write that out to another file (the only line other than license boilerplate would be the above line). Do you like that? https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:517: bool LookupStringInSet(base::StringPiece needle, On 2017/05/09 01:37:04, Peter Kasting wrote: > Nit: If you're not going to use boring names for your params, I'd copy the ones > from the underlying net:: declaration rather than using |needle|. > > That said, this wrapper is so short, and is called only once, that I'd just > inline the body of this at the callsite below. Inlined it. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:528: DCHECK(hostname[hostname.length() - 1] != '.'); On 2017/05/09 01:37:03, Peter Kasting wrote: > Nit: hostname.back() Done. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:533: labels.erase(labels.begin()); On 2017/05/09 01:37:04, Peter Kasting wrote: > Nit: Seems like a single call to vector::erase could be more efficient than a > while loop. Done. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:535: while (labels.size() > 1) { On 2017/05/09 01:37:03, Peter Kasting wrote: > Is this naive loop faster than computing the actual eTLD+1 length using the RCDS > and then doing a single DAFSA lookup? 'hostname' is not a good name (at one point, it's either a hostname or its skeleton, but now it's always a skeleton). Changed it to |skeleton|. eTLD match cannot be done because even 'com' is turned to 'c o r n' (without spaces). We can try the eTLD match before the skeleton calculation. Hmm, it appears that RCDS canonicalizes 'hostname' before finding eTLD+1. The canonicalization would turn an IDN to punycode That would not work here. Because # of labels is limited to 3 here, at most two look ups are done here. So, I'd expect little difference even if RCDS works without canonicalization.. OTOH, thanks to your suggestion, it occurred to me that I can change 2-step (python + C++) into one step (C++ using RCDS that accepts URLs to extract eTLD +1) https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:546: (*(hostname.rbegin()) == '.' ? 1 : 0); On 2017/05/09 01:37:03, Peter Kasting wrote: > Nit: Use .back() instead of *rbegin() Done. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:552: ustr_host.length() && transliterator_) On 2017/05/09 01:37:03, Peter Kasting wrote: > Note that if the DCHECK earlier is assumed not to fail, this null-check can > disappear. removed it. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:557: uspoof_getSkeletonUnicodeString(checker_, 0, /* not used. deprecated. */ On 2017/05/09 01:37:03, Peter Kasting wrote: > Nit: If you're going to add /* */ (which I'm not sure is necessary), do so > before the comma to make it very clear which parameter this is on. > > "deprecated." is also probably unnecessary here. ok. just removed it.
Addressed most of Peter's comments in the latest PS
On 2017/05/10 18:05:21, jungshik at Google wrote: > Addressed most of Peter's comments in the latest PS I meant PS 42.
+nparker for the following question: This proposed defense against IDN homograph spoofing requires a list of 'popular websites' to check against for visual similarity; strings that are too visually similar to domains on the list would display as punycode in the address bar. However, we don't have a great solution for generating this list and maintaining it over time -- we're currently using an alexa list that's very out of date, and which will only become more so. Does the SafeBrowsing team have any capacity to provide a data source like this, or if not, can you think of anyone who does? (e.g., aggregated urls from password sync metadata)
This is looking pretty good. https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... File components/url_formatter/url_formatter.cc (right): https://codereview.chromium.org/2784933002/diff/730001/components/url_formatt... components/url_formatter/url_formatter.cc:515: const size_t kNumberOfLabelsToCheck = 3; On 2017/05/10 18:05:13, jungshik at Google wrote: > On 2017/05/09 01:37:03, Peter Kasting wrote: > > Can we write this value into the file so we don't need to hardcode it here? > > make_top_domain_gperf can write that out to another file (the only line other > than license boilerplate would be the above line). Do you like that? Can't we write it to the same file? Having a separate file for this works, I guess, it just feels inelegant. Dunno why I'm worrying. It'd still probably be better than hardcoding this. https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... File components/url_formatter/idn_spoof_checker.cc (right): https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/idn_spoof_checker.cc:29: DCHECK(skeleton.back() != '.'); Nit: DCHECK_NE('.', skeleton.back())? https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/idn_spoof_checker.cc:114: // This set is used to determine whether or not to apply a slow Since you're both moving this code to this file and modifying it, it's hard to tell what the functional vs. nonfunctional diff from this change is. Consider splitting off a separate initial CL that just moves the code out to its own file without modifying it (more than is necessary), then having this CL only make functional changes. https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/idn_spoof_checker.cc:284: // a subset of |cyrillic_letters_latin_alike_|. Nit: The sentence above is the sort of thing I'd expect to see on the function declaration instead. https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... File components/url_formatter/idn_spoof_checker.h (right): https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/idn_spoof_checker.h:5: // A helper class for IDN Spoof checking, used to ensure that no IDN input is Nit: Maybe this should be just above the class itself, since that's what it seems to describe. https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/idn_spoof_checker.h:23: #include "third_party/icu/source/i18n/unicode/uspoof.h" Nit: Seems like most of these #includes are not needed in this header, but only in the .cc file. Of the ICU stuff, only UnicodeSet seems like it needs to be #included; the few other things referenced below can be forward-declared. https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/idn_spoof_checker.h:48: // Sets allowed characters in IDN labels and turn on USPOOF_CHAR_LIMIT. Nit: turns https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/idn_spoof_checker.h:50: bool IsMadeOfLatinAlikeCyrillic(const icu::UnicodeString& label_string); Nit: These other two functions don't have comments (probably should) and definitely should be separated from the function above by a blank line since the comment above it doesn't also apply to them. https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... File components/url_formatter/top_domains/README (right): https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/top_domains/README:8: * alexa_names_and_skeletons.gperf Nit: File has been renamed? https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/top_domains/README:12: url_formatter.cc. This must be regenerated as follows if ICU is updated, since Nit: 80 columns https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/top_domains/README:15: Nit: Second blank line not needed https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... File components/url_formatter/top_domains/make_alexa_top_list.py (right): https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/top_domains/make_alexa_top_list.py:46: # Add some popular domains if they're missing. Nit: Maybe link to a bug on getting a process in place whereby we have an updated list? https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... File components/url_formatter/top_domains/make_top_domain_gperf.cc (right): https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/top_domains/make_top_domain_gperf.cc:26: uspoof_getSkeletonUnicodeString(spoof_checker, 0, /* not used */ Nit: Suggest ',' after */ instead of before https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/top_domains/make_top_domain_gperf.cc:37: .Append(FILE_PATH_LITERAL("url_formatter")) Nit: Is this how git cl format indented this? I'd have expected dots aligned. https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/top_domains/make_top_domain_gperf.cc:44: bool succeeded = base::WriteFile(path, content.data(), content.size()) != -1; Again, shouldn't this check "== content.size()"? https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/top_domains/make_top_domain_gperf.cc:45: CHECK(succeeded) << "Failed to write to " << path.AsUTF8Unsafe() << '\n'; Nit: Maybe rather than CHECK, return |succeeded|, so the caller can cerr/return nonzero. https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/top_domains/make_top_domain_gperf.cc:75: "// Copyright 2017 The Chromium Authors. All rights reserved.\n" Nit: Maybe consider a raw string literal https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... File components/url_formatter/top_domains/top.list (right): https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/top_domains/top.list:1: facebook.com Why do we have both this file and alexa_domains.list? https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... File components/url_formatter/url_formatter.cc (right): https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/url_formatter.cc:193: #if 0 Remove before landing https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... File components/url_formatter/url_formatter_unittest.cc (right): https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/url_formatter_unittest.cc:286: // 'digklmo68.co.uk" are listed for unittest in the top domain list. I still would like to see the test poke these into place rather than have them included in the compiled-in list.
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Thanks for a thorough review. (as you found out, pulling out IDNSpoofChecker was done separately and the CL was landed). The latest CL is on top of that CL. A couple of Peter's comments to take care of, but most have been addressed. https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... File components/url_formatter/idn_spoof_checker.cc (right): https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/idn_spoof_checker.cc:29: DCHECK(skeleton.back() != '.'); On 2017/05/10 22:38:46, Peter Kasting wrote: > Nit: DCHECK_NE('.', skeleton.back())? Done. https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/idn_spoof_checker.cc:114: // This set is used to determine whether or not to apply a slow On 2017/05/10 22:38:46, Peter Kasting wrote: > Since you're both moving this code to this file and modifying it, it's hard to > tell what the functional vs. nonfunctional diff from this change is. Consider > splitting off a separate initial CL that just moves the code out to its own file > without modifying it (more than is necessary), then having this CL only make > functional changes. Ok. A file reorg CL is https://codereview.chromium.org/2877973003 . https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/idn_spoof_checker.cc:284: // a subset of |cyrillic_letters_latin_alike_|. On 2017/05/10 22:38:46, Peter Kasting wrote: > Nit: The sentence above is the sort of thing I'd expect to see on the function > declaration instead. Added to the header. https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... File components/url_formatter/idn_spoof_checker.h (right): https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/idn_spoof_checker.h:5: // A helper class for IDN Spoof checking, used to ensure that no IDN input is On 2017/05/10 22:38:46, Peter Kasting wrote: > Nit: Maybe this should be just above the class itself, since that's what it > seems to describe. Done. https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/idn_spoof_checker.h:23: #include "third_party/icu/source/i18n/unicode/uspoof.h" On 2017/05/10 22:38:46, Peter Kasting wrote: > Nit: Seems like most of these #includes are not needed in this header, but only > in the .cc file. Of the ICU stuff, only UnicodeSet seems like it needs to be > #included; the few other things referenced below can be forward-declared. Actually, I tried forward declarations before falling back to include headers in .h file. namespace icu { class UnicodeString; class Transliterator; ..... ...} does not work because of the way 'icu' is defined. However, I forgot that I could use U_ICU_NAMESPACE in place of 'icu'. With that, forward declaration works. Thank you for the comment. https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/idn_spoof_checker.h:48: // Sets allowed characters in IDN labels and turn on USPOOF_CHAR_LIMIT. On 2017/05/10 22:38:46, Peter Kasting wrote: > Nit: turns Done. https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/idn_spoof_checker.h:50: bool IsMadeOfLatinAlikeCyrillic(const icu::UnicodeString& label_string); On 2017/05/10 22:38:46, Peter Kasting wrote: > Nit: These other two functions don't have comments (probably should) and > definitely should be separated from the function above by a blank line since the > comment above it doesn't also apply to them. will add desc. https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... File components/url_formatter/top_domains/README (right): https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/top_domains/README:8: * alexa_names_and_skeletons.gperf On 2017/05/10 22:38:47, Peter Kasting wrote: > Nit: File has been renamed? Corrected. https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/top_domains/README:12: url_formatter.cc. This must be regenerated as follows if ICU is updated, since On 2017/05/10 22:38:47, Peter Kasting wrote: > Nit: 80 columns Done. https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/top_domains/README:15: On 2017/05/10 22:38:47, Peter Kasting wrote: > Nit: Second blank line not needed Done. https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... File components/url_formatter/top_domains/make_alexa_top_list.py (right): https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/top_domains/make_alexa_top_list.py:46: # Add some popular domains if they're missing. On 2017/05/10 22:38:47, Peter Kasting wrote: > Nit: Maybe link to a bug on getting a process in place whereby we have an > updated list? https://bugs.chromium.org/p/chromium/issues/detail?id=722022 https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... File components/url_formatter/top_domains/make_top_domain_gperf.cc (right): https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/top_domains/make_top_domain_gperf.cc:26: uspoof_getSkeletonUnicodeString(spoof_checker, 0, /* not used */ On 2017/05/10 22:38:47, Peter Kasting wrote: > Nit: Suggest ',' after */ instead of before Done. https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/top_domains/make_top_domain_gperf.cc:37: .Append(FILE_PATH_LITERAL("url_formatter")) On 2017/05/10 22:38:47, Peter Kasting wrote: > Nit: Is this how git cl format indented this? I'd have expected dots aligned. Yes, even if I get dots aligned manually, 'git cl format' put them back where they used to be. https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/top_domains/make_top_domain_gperf.cc:44: bool succeeded = base::WriteFile(path, content.data(), content.size()) != -1; On 2017/05/10 22:38:47, Peter Kasting wrote: > Again, shouldn't this check "== content.size()"? Done. https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/top_domains/make_top_domain_gperf.cc:45: CHECK(succeeded) << "Failed to write to " << path.AsUTF8Unsafe() << '\n'; On 2017/05/10 22:38:47, Peter Kasting wrote: > Nit: Maybe rather than CHECK, return |succeeded|, so the caller can cerr/return > nonzero. Done. https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/top_domains/make_top_domain_gperf.cc:75: "// Copyright 2017 The Chromium Authors. All rights reserved.\n" On 2017/05/10 22:38:47, Peter Kasting wrote: > Nit: Maybe consider a raw string literal Done. https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... File components/url_formatter/top_domains/top.list (right): https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/top_domains/top.list:1: facebook.com On 2017/05/10 22:38:47, Peter Kasting wrote: > Why do we have both this file and alexa_domains.list? "git commit -a" was run without checking the actual file list being added. Removed in the latest CL. https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... File components/url_formatter/url_formatter.cc (right): https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/url_formatter.cc:193: #if 0 On 2017/05/10 22:38:47, Peter Kasting wrote: > Remove before landing Done. https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... File components/url_formatter/url_formatter_unittest.cc (right): https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/url_formatter_unittest.cc:286: // 'digklmo68.co.uk" are listed for unittest in the top domain list. On 2017/05/10 22:38:47, Peter Kasting wrote: > I still would like to see the test poke these into place rather than have them > included in the compiled-in list. I don't disagree with you, but couldn't think of a simple way to add them only for test at run-time. Perhaps, I can do something similar to what's done in net/base/registry_controlled_domains/registry_controlled_domain.h and related files.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... File components/url_formatter/url_formatter_unittest.cc (right): https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/url_formatter_unittest.cc:286: // 'digklmo68.co.uk" are listed for unittest in the top domain list. On 2017/05/14 09:36:23, jungshik at Google wrote: > On 2017/05/10 22:38:47, Peter Kasting wrote: > > I still would like to see the test poke these into place rather than have them > > included in the compiled-in list. > > I don't disagree with you, but couldn't think of a simple way to add them only > for test at run-time. > > Perhaps, I can do something similar to what's done in > net/base/registry_controlled_domains/registry_controlled_domain.h and related > files. The way the other unittests work, we do a separate DAFSA generation step on the test data, and there's a SetDafsaForTesting() function (SetFindDomainGraph) that lets you swap out the global DAFSA structure. Alternatively, you could add a layer of indirection that would mock out the DAFSA lookup altogether, but IMO the first approach is better: it's worthwhile for tests to actually exercise the real dafsa machinery.
LGTM, I trust you to follow up appropriately on outstanding issues https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... File components/url_formatter/url_formatter_unittest.cc (right): https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... components/url_formatter/url_formatter_unittest.cc:286: // 'digklmo68.co.uk" are listed for unittest in the top domain list. On 2017/05/15 18:27:57, ncarter wrote: > On 2017/05/14 09:36:23, jungshik at Google wrote: > > On 2017/05/10 22:38:47, Peter Kasting wrote: > > > I still would like to see the test poke these into place rather than have > them > > > included in the compiled-in list. > > > > I don't disagree with you, but couldn't think of a simple way to add them only > > for test at run-time. > > > > Perhaps, I can do something similar to what's done in > > net/base/registry_controlled_domains/registry_controlled_domain.h and related > > files. > > The way the other unittests work, we do a separate DAFSA generation step on the > test data, and there's a SetDafsaForTesting() function (SetFindDomainGraph) that > lets you swap out the global DAFSA structure. SGTM https://codereview.chromium.org/2784933002/diff/830001/components/url_formatt... File components/url_formatter/idn_spoof_checker.h (right): https://codereview.chromium.org/2784933002/diff/830001/components/url_formatt... components/url_formatter/idn_spoof_checker.h:57: // Returns true if all the Cyrillic letters in |label| belong to a set of Nit: I would put a blank line above this comment and the next one, just to make "comment + function" easier for the eye to parse visually. https://codereview.chromium.org/2784933002/diff/830001/components/url_formatt... components/url_formatter/idn_spoof_checker.h:61: // successfully and stored in |skeleton|. Nit: This comment sounds like the function just checks the skeleton rather than computing it? Maybe "Stores the confusability skeleton for |hostname| in |skeleton|. Returns whether the computation was successful."? All that said, I don't actually see this method defined anywhere in the codebase. Maybe it doesn't really exist?
On 2017/05/15 18:27:57, ncarter (slow) wrote: > https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... > File components/url_formatter/url_formatter_unittest.cc (right): > > https://codereview.chromium.org/2784933002/diff/750001/components/url_formatt... > components/url_formatter/url_formatter_unittest.cc:286: // 'digklmo68.co.uk" are > listed for unittest in the top domain list. > On 2017/05/14 09:36:23, jungshik at Google wrote: > > On 2017/05/10 22:38:47, Peter Kasting wrote: > > > I still would like to see the test poke these into place rather than have > them > > > included in the compiled-in list. > > > > I don't disagree with you, but couldn't think of a simple way to add them only > > for test at run-time. > > > > Perhaps, I can do something similar to what's done in > > net/base/registry_controlled_domains/registry_controlled_domain.h and related > > files. > > The way the other unittests work, we do a separate DAFSA generation step on the > test data, and there's a SetDafsaForTesting() function (SetFindDomainGraph) that > lets you swap out the global DAFSA structure. > > Alternatively, you could add a layer of indirection that would mock out the > DAFSA lookup altogether, but IMO the first approach is better: it's worthwhile > for tests to actually exercise the real dafsa machinery. Yup, that's what I meant by 'doing something similar to what's done in net/base/rcd/rcd.h' (I saw what's done there :-)). And, I agree with you that the first approach is better.
Thank you, Peter and Nick. > LGTM, I trust you to follow up appropriately on outstanding issues Thank you for trusting me :-) Would you mind if land this CL as it is now (with nits taken care of ) and make a follow-up CL to address those issues? I already have a CL that does #1 and partly #2, but have been swamped with other issues. Because a few IDN-related bugs would be resolved with this CL, I like to get this one landed and work on the follow-up. 1) use a 1-step instead of 2-step process to generate *gper file; this can be done using one of methods in r-c-d 2) have a separate gperf file for unit test 3) avoid hard-coding the max number of labels to test for match; r-c-d can help this too except that I need a method to r-c-d that would treat an input domain names as canonicalized (although it's not because it has IDN) without checking. This would be used to get eTLD+1 (where eTLD is all ASCII). There's already one in r-c-d but in a debug build it'd barf that the input is not canonicalized when I pass an IDN. https://codereview.chromium.org/2784933002/diff/830001/components/url_formatt... File components/url_formatter/idn_spoof_checker.h (right): https://codereview.chromium.org/2784933002/diff/830001/components/url_formatt... components/url_formatter/idn_spoof_checker.h:57: // Returns true if all the Cyrillic letters in |label| belong to a set of On 2017/05/15 18:55:31, Peter Kasting wrote: > Nit: I would put a blank line above this comment and the next one, just to make > "comment + function" easier for the eye to parse visually. Done. https://codereview.chromium.org/2784933002/diff/830001/components/url_formatt... components/url_formatter/idn_spoof_checker.h:61: // successfully and stored in |skeleton|. On 2017/05/15 18:55:31, Peter Kasting wrote: > Nit: This comment sounds like the function just checks the skeleton rather than > computing it? Maybe "Stores the confusability skeleton for |hostname| in > |skeleton|. Returns whether the computation was successful."? > > All that said, I don't actually see this method defined anywhere in the > codebase. Maybe it doesn't really exist? ooops. Yeah, it's gone a few PS's ago. Thank you for the catch.
Following up in another CL is fine with me.
Description was changed from ========== Add checks against spoofing attempt at top domains Remove diacritic marks from a hostname and calculate the confusability skeleton of the accent-free name. Look it up in the pre-calculated list of the skeletons of top 10k domains. Removing diacritic marks from a hostname is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Also add two more mappings ([кĸκ] > k, п > n) to supplement the Unicode's confusables list. Binary file size increase: ~ 59kB for the DAFSA representation of top domain name skeletons. The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4571) In addition, disallow combining diarctic marks unless they're preceded by Latin-Greek-Cyrillic. BUG=703750,714628,719199 TEST=components_unittests --gtest_filter=*IDNToUni* ========== to ========== Add checks against spoofing attempt at top domains Remove diacritic marks from a hostname and calculate the confusability skeleton of the accent-free name. Look it up in the pre-calculated list of the skeletons of top 10k domains. Removing diacritic marks from a hostname is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Also add two more mappings ([кĸκ] > k, п > n) to supplement the Unicode's confusables list. Binary file size increase: ~ 59kB for the DAFSA representation of top domain name skeletons. The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4571) In addition, disallow combining diarctic marks unless they're preceded by Latin-Greek-Cyrillic. BUG=703750,714628,719199,722639 TEST=components_unittests --gtest_filter=*IDNToUni* ==========
The CQ bit was checked by jshin@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from nick@chromium.org, pkasting@chromium.org Link to the patchset: https://codereview.chromium.org/2784933002/#ps850001 (title: "Peter's latest comments addressed")
CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
CQ is committing da patch. Bot data: {"patchset_id": 850001, "attempt_start_ts": 1495171778009680, "parent_rev": "375fc7a1d4e279c675ce23f239604f8aa80fef53", "commit_rev": "a8add0308ba6067eb3de5a8fe82f9c2f2460ad91"}
Message was sent while issue was closed.
Description was changed from ========== Add checks against spoofing attempt at top domains Remove diacritic marks from a hostname and calculate the confusability skeleton of the accent-free name. Look it up in the pre-calculated list of the skeletons of top 10k domains. Removing diacritic marks from a hostname is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Also add two more mappings ([кĸκ] > k, п > n) to supplement the Unicode's confusables list. Binary file size increase: ~ 59kB for the DAFSA representation of top domain name skeletons. The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4571) In addition, disallow combining diarctic marks unless they're preceded by Latin-Greek-Cyrillic. BUG=703750,714628,719199,722639 TEST=components_unittests --gtest_filter=*IDNToUni* ========== to ========== Add checks against spoofing attempt at top domains Remove diacritic marks from a hostname and calculate the confusability skeleton of the accent-free name. Look it up in the pre-calculated list of the skeletons of top 10k domains. Removing diacritic marks from a hostname is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Also add two more mappings ([кĸκ] > k, п > n) to supplement the Unicode's confusables list. Binary file size increase: ~ 59kB for the DAFSA representation of top domain name skeletons. The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4571) In addition, disallow combining diarctic marks unless they're preceded by Latin-Greek-Cyrillic. BUG=703750,714628,719199,722639 TEST=components_unittests --gtest_filter=*IDNToUni* Review-Url: https://codereview.chromium.org/2784933002 Cr-Commit-Position: refs/heads/master@{#473109} Committed: https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f... ==========
Message was sent while issue was closed.
Committed patchset #47 (id:850001) as https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f...
Message was sent while issue was closed.
A revert of this CL (patchset #47 id:850001) has been created in https://codereview.chromium.org/2889303003/ by tsergeant@chromium.org. The reason for reverting is: This CL is causing compile to fail on Win x64: https://build.chromium.org/p/chromium/builders/Win%20x64/builds/11432 FAILED: obj/components/url_formatter/top_domains/make_top_domain_gperf/make_top_domain_gperf.obj make_top_domain_gperf.cc(46): error C2220: warning treated as error - no 'object' file generated make_top_domain_gperf.cc(46): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data .
Message was sent while issue was closed.
Findit (https://goo.gl/kROfz5) confirmed this CL at revision 473109 as the culprit for failures in the build cycles as shown on: https://findit-for-me.appspot.com/waterfall/culprit?key=ag9zfmZpbmRpdC1mb3Itb...
Message was sent while issue was closed.
Description was changed from ========== Add checks against spoofing attempt at top domains Remove diacritic marks from a hostname and calculate the confusability skeleton of the accent-free name. Look it up in the pre-calculated list of the skeletons of top 10k domains. Removing diacritic marks from a hostname is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Also add two more mappings ([кĸκ] > k, п > n) to supplement the Unicode's confusables list. Binary file size increase: ~ 59kB for the DAFSA representation of top domain name skeletons. The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4571) In addition, disallow combining diarctic marks unless they're preceded by Latin-Greek-Cyrillic. BUG=703750,714628,719199,722639 TEST=components_unittests --gtest_filter=*IDNToUni* Review-Url: https://codereview.chromium.org/2784933002 Cr-Commit-Position: refs/heads/master@{#473109} Committed: https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f... ========== to ========== Add checks against spoofing attempt at top domains Remove diacritic marks from a hostname and calculate the confusability skeleton of the accent-free name. Look it up in the pre-calculated list of the skeletons of top 10k domains. Removing diacritic marks from a hostname is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Also add two more mappings ([кĸκ] > k, п > n) to supplement the Unicode's confusables list. Binary file size increase: ~ 59kB for the DAFSA representation of top domain name skeletons. The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4571) In addition, disallow combining diarctic marks unless they're preceded by Latin-Greek-Cyrillic. BUG=703750,714628,719199,722639 TEST=components_unittests --gtest_filter=*IDNToUni* Review-Url: https://codereview.chromium.org/2784933002 Cr-Commit-Position: refs/heads/master@{#473109} Committed: https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f... ==========
Description was changed from ========== Add checks against spoofing attempt at top domains Remove diacritic marks from a hostname and calculate the confusability skeleton of the accent-free name. Look it up in the pre-calculated list of the skeletons of top 10k domains. Removing diacritic marks from a hostname is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Also add two more mappings ([кĸκ] > k, п > n) to supplement the Unicode's confusables list. Binary file size increase: ~ 59kB for the DAFSA representation of top domain name skeletons. The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4571) In addition, disallow combining diarctic marks unless they're preceded by Latin-Greek-Cyrillic. BUG=703750,714628,719199,722639 TEST=components_unittests --gtest_filter=*IDNToUni* Review-Url: https://codereview.chromium.org/2784933002 Cr-Commit-Position: refs/heads/master@{#473109} Committed: https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f... ========== to ========== Add checks against spoofing attempt at top domains Remove diacritic marks from a hostname and calculate the confusability skeleton of the accent-free name. Look it up in the pre-calculated list of the skeletons of top 10k domains. Removing diacritic marks from a hostname is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Also add two more mappings ([кĸκ] > k, п > n) to supplement the Unicode's confusables list. Binary file size increase: ~ 59kB for the DAFSA representation of top domain name skeletons. The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4571) In addition, disallow combining diarctic marks unless they're preceded by Latin-Greek-Cyrillic. BUG=703750,714628,719199,722639 TEST=components_unittests --gtest_filter=*IDNToUni* CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.win:win_chromium_x64_rel_ng,win_clang_x64_rel,win10_chromium_x64_rel_ng Review-Url: https://codereview.chromium.org/2784933002 Cr-Commit-Position: refs/heads/master@{#473109} Committed: https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f... ==========
Description was changed from ========== Add checks against spoofing attempt at top domains Remove diacritic marks from a hostname and calculate the confusability skeleton of the accent-free name. Look it up in the pre-calculated list of the skeletons of top 10k domains. Removing diacritic marks from a hostname is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Also add two more mappings ([кĸκ] > k, п > n) to supplement the Unicode's confusables list. Binary file size increase: ~ 59kB for the DAFSA representation of top domain name skeletons. The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4571) In addition, disallow combining diarctic marks unless they're preceded by Latin-Greek-Cyrillic. BUG=703750,714628,719199,722639 TEST=components_unittests --gtest_filter=*IDNToUni* CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.win:win_chromium_x64_rel_ng,win_clang_x64_rel,win10_chromium_x64_rel_ng Review-Url: https://codereview.chromium.org/2784933002 Cr-Commit-Position: refs/heads/master@{#473109} Committed: https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f... ========== to ========== Add checks against spoofing attempt at top domains Remove diacritic marks from a hostname and calculate the confusability skeleton of the accent-free name. Look it up in the pre-calculated list of the skeletons of top 10k domains. Removing diacritic marks from a hostname is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Also add two more mappings ([кĸκ] > k, п > n) to supplement the Unicode's confusables list. Binary file size increase: ~ 59kB for the DAFSA representation of top domain name skeletons. The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4571) In addition, disallow combining diarctic marks unless they're preceded by Latin-Greek-Cyrillic. BUG=703750,714628,719199,722639 TEST=components_unittests --gtest_filter=*IDNToUni* CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.win:win_chromium_x64_rel_ng,win10_chromium_x64_rel_ng Review-Url: https://codereview.chromium.org/2784933002 Cr-Commit-Position: refs/heads/master@{#473109} Committed: https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f... ==========
Description was changed from ========== Add checks against spoofing attempt at top domains Remove diacritic marks from a hostname and calculate the confusability skeleton of the accent-free name. Look it up in the pre-calculated list of the skeletons of top 10k domains. Removing diacritic marks from a hostname is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Also add two more mappings ([кĸκ] > k, п > n) to supplement the Unicode's confusables list. Binary file size increase: ~ 59kB for the DAFSA representation of top domain name skeletons. The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4571) In addition, disallow combining diarctic marks unless they're preceded by Latin-Greek-Cyrillic. BUG=703750,714628,719199,722639 TEST=components_unittests --gtest_filter=*IDNToUni* CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.win:win_chromium_x64_rel_ng,win10_chromium_x64_rel_ng Review-Url: https://codereview.chromium.org/2784933002 Cr-Commit-Position: refs/heads/master@{#473109} Committed: https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f... ========== to ========== Add checks against spoofing attempt at top domains Relanding after revert with a compile fix for Win x64. Remove diacritic marks from a hostname and calculate the confusability skeleton of the accent-free name. Look it up in the pre-calculated list of the skeletons of top 10k domains. Removing diacritic marks from a hostname is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Also add two more mappings ([кĸκ] > k, п > n) to supplement the Unicode's confusables list. Binary file size increase: ~ 59kB for the DAFSA representation of top domain name skeletons. The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4571) In addition, disallow combining diarctic marks unless they're preceded by Latin-Greek-Cyrillic. BUG=703750,714628,719199,722639 TEST=components_unittests --gtest_filter=*IDNToUni* CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.win:win_chromium_x64_rel_ng,win10_chromium_x64_rel_ng Review-Url: https://codereview.chromium.org/2784933002 Cr-Commit-Position: refs/heads/master@{#473109} Committed: https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f... ==========
Description was changed from ========== Add checks against spoofing attempt at top domains Relanding after revert with a compile fix for Win x64. Remove diacritic marks from a hostname and calculate the confusability skeleton of the accent-free name. Look it up in the pre-calculated list of the skeletons of top 10k domains. Removing diacritic marks from a hostname is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Also add two more mappings ([кĸκ] > k, п > n) to supplement the Unicode's confusables list. Binary file size increase: ~ 59kB for the DAFSA representation of top domain name skeletons. The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4571) In addition, disallow combining diarctic marks unless they're preceded by Latin-Greek-Cyrillic. BUG=703750,714628,719199,722639 TEST=components_unittests --gtest_filter=*IDNToUni* CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.win:win_chromium_x64_rel_ng,win10_chromium_x64_rel_ng Review-Url: https://codereview.chromium.org/2784933002 Cr-Commit-Position: refs/heads/master@{#473109} Committed: https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f... ========== to ========== Add checks against spoofing attempt at top domains Relanding after revert with a compile fix for Win x64. Remove diacritic marks from a hostname and calculate the confusability skeleton of the accent-free name. Look it up in the pre-calculated list of the skeletons of top 10k domains. Removing diacritic marks from a hostname is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Also add two more mappings ([кĸκ] > k, п > n) to supplement the Unicode's confusables list. Binary file size increase: ~ 59kB for the DAFSA representation of top domain name skeletons. The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4571) In addition, disallow combining diarctic marks unless they're preceded by Latin-Greek-Cyrillic. BUG=703750,714628,719199,722639 TEST=components_unittests --gtest_filter=*IDNToUni* CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.win:win_chromium_x64_rel_ng,win10_chromium_x64_rel_ng Review-Url: https://codereview.chromium.org/2784933002 Cr-Commit-Position: refs/heads/master@{#473109} Committed: https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f... ==========
Message was sent while issue was closed.
Description was changed from ========== Add checks against spoofing attempt at top domains Relanding after revert with a compile fix for Win x64. Remove diacritic marks from a hostname and calculate the confusability skeleton of the accent-free name. Look it up in the pre-calculated list of the skeletons of top 10k domains. Removing diacritic marks from a hostname is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Also add two more mappings ([кĸκ] > k, п > n) to supplement the Unicode's confusables list. Binary file size increase: ~ 59kB for the DAFSA representation of top domain name skeletons. The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4571) In addition, disallow combining diarctic marks unless they're preceded by Latin-Greek-Cyrillic. BUG=703750,714628,719199,722639 TEST=components_unittests --gtest_filter=*IDNToUni* CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.win:win_chromium_x64_rel_ng,win10_chromium_x64_rel_ng Review-Url: https://codereview.chromium.org/2784933002 Cr-Commit-Position: refs/heads/master@{#473109} Committed: https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f... ========== to ========== Add checks against spoofing attempt at top domains Remove diacritic marks from a hostname and calculate the confusability skeleton of the accent-free name. Look it up in the pre-calculated list of the skeletons of top 10k domains. Removing diacritic marks from a hostname is equivalent to comparing names with the primary collation strength in the root locale. To make them equivalent, three mappings are added (ł > l; ø > o; đ > d) on top of the diacritic-removal. Also add two more mappings ([кĸκ] > k, п > n) to supplement the Unicode's confusables list. Binary file size increase: ~ 59kB for the DAFSA representation of top domain name skeletons. The IDN display policy check takes ~ 2µs longer on the average (3.3 µs => 5.5µs) on my machine per the test run over ~1 million IDNs in com TLD). It adds about 1500 domains to the list of domains to display in Punycode out of ~ 1 million IDNs in com TLD. (3018 => 4571) In addition, disallow combining diarctic marks unless they're preceded by Latin-Greek-Cyrillic. BUG=703750,714628,719199,722639 TEST=components_unittests --gtest_filter=*IDNToUni* CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.win:win_chromium_x64_rel_ng,win10_chromium_x64_rel_ng Review-Url: https://codereview.chromium.org/2784933002 Cr-Commit-Position: refs/heads/master@{#473109} Committed: https://chromium.googlesource.com/chromium/src/+/a8add0308ba6067eb3de5a8fe82f... ==========
Message was sent while issue was closed.
Instead of reopening the committed CL, decided to make a new CL with 'size_t <-> int" conversion issue addressed with checked_cast<> for win_x64. https://codereview.chromium.org/2897873002 |