| 
 | 
 | 
 Chromium Code Reviews
 Chromium Code Reviews Issue 
            2683793010:
    Block domain labels made of Cyrillic letters that look alike Latin  (Closed)
    
  
    Issue 
            2683793010:
    Block domain labels made of Cyrillic letters that look alike Latin  (Closed) 
  | Created: 3 years, 10 months ago by jungshik at Google Modified: 3 years, 9 months ago CC: markdavis, sffc, emilyschechter Target Ref: refs/heads/master Project: chromium Visibility: Public. | DescriptionBlock domain labels made of Cyrillic letters that look alike Latin
Block a label made entirely of Latin-look-alike Cyrillic letters when the TLD is not an IDN (i.e. this check is ON only for TLDs like 'com', 'net', 'uk', but not applied for IDN TLDs like рф.
BUG=683314
TEST=components_unittests --gtest_filter=U*IDN*
Review-Url: https://codereview.chromium.org/2683793010
Cr-Commit-Position: refs/heads/master@{#459226}
Committed: https://chromium.googlesource.com/chromium/src/+/08cb718ba7c3961c1006176c9faba0a5841ec792
   Patch Set 1 #Patch Set 2 : check Cyrl-Latn alikes only for non-IDN tlds #
      Total comments: 11
      
     Patch Set 3 : exclude non-Letter when checking against Cyrl-Latn look-alikes #Patch Set 4 : Streamline isTLD ASCII check #Patch Set 5 : Add ӕъЬвҽвԍнкпгмтцѵѡү to the list of Cyr-Lat alike set #Patch Set 6 : Do not include "ӕ вҽвԍн мтц" in the Cyr-Lat alike list #Patch Set 7 : Do not include "ҽк ѡү" in Cyr-Lat alike list #Patch Set 8 : Add back ҽ and ѡ to the Cyr-Lat alike-set #Patch Set 9 : a bit more tweaking; no change in the blocked set #Patch Set 10 : add a test with U+00B7 and Latin-alike Cyrillics #
      Total comments: 8
      
     Patch Set 11 : address review comments by Peter #Patch Set 12 : remove unncessary check #Patch Set 13 : go back to ps11 #
 Messages
    Total messages: 25 (9 generated)
     
 Description was changed from ========== Block domain labels made of Cyrillic letters that look alike Latin Block a label made entirely of Latin-look-alike Cyrillic letters BUG= ========== to ========== Block domain labels made of Cyrillic letters that look alike Latin Block a label made entirely of Latin-look-alike Cyrillic letters BUG=683314 TEST=components_unittests --gtest_filter=U*IDN* ========== 
 Description was changed from ========== Block domain labels made of Cyrillic letters that look alike Latin Block a label made entirely of Latin-look-alike Cyrillic letters BUG=683314 TEST=components_unittests --gtest_filter=U*IDN* ========== to ========== Block domain labels made of Cyrillic letters that look alike Latin Block a label made entirely of Latin-look-alike Cyrillic letters when the TLD is not an IDN (i.e. this check is ON only for TLDs like 'com', 'net', 'uk', but not applied for IDN TLDs like рф. BUG=683314 TEST=components_unittests --gtest_filter=U*IDN* ========== 
 
 https://codereview.chromium.org/2683793010/diff/20001/components/url_formatte... File components/url_formatter/url_formatter.cc (right): https://codereview.chromium.org/2683793010/diff/20001/components/url_formatte... components/url_formatter/url_formatter.cc:331: icu::UnicodeString("[аеорсухьѕіјһмӏтнв]"), status); "м, т, н, в" look like smallcap Latin and it's debatable whether or not to include them in the set. 
 sffc@google.com changed reviewers: + sffc@google.com 
 https://codereview.chromium.org/2683793010/diff/20001/components/url_formatte... File components/url_formatter/url_formatter.cc (right): https://codereview.chromium.org/2683793010/diff/20001/components/url_formatte... components/url_formatter/url_formatter.cc:209: is_tld_ascii = false; I don't really understand what this part of the code is doing. https://codereview.chromium.org/2683793010/diff/20001/components/url_formatte... components/url_formatter/url_formatter.cc:331: icu::UnicodeString("[аеорсухьѕіјһмӏтнв]"), status); On 2017/02/15 18:50:21, jungshik at Google wrote: > "м, т, н, в" look like smallcap Latin and it's debatable whether or not to > include them in the set. I'd look at the capital letters too. Here's a possible list of strong matches based on manual inspection of confusables.txt, which includes more characters than your list: асԁеһіјӏорԛѕԝхуАВСЕԌНІӀЈКМОРЅТԜХҮ And here's a possibly incomplete list of weaker matches; up to you whether you want to include them: ӕъЬвҽвԍнкпгмтцѵѡүӔѴҲ https://codereview.chromium.org/2683793010/diff/20001/components/url_formatte... components/url_formatter/url_formatter.cc:376: if (is_tld_ascii && cyrillic_letters_latin_alike_.containsAll(label_string)) I think you should compare only the letter characters. For example, a spoof string "рох-рох" that contains a non-letter character should still be considered a spoof attack. 
 lgarron@chromium.org changed reviewers: + lgarron@chromium.org 
 https://codereview.chromium.org/2683793010/diff/20001/components/url_formatte... File components/url_formatter/url_formatter.cc (right): https://codereview.chromium.org/2683793010/diff/20001/components/url_formatte... components/url_formatter/url_formatter.cc:331: icu::UnicodeString("[аеорсухьѕіјһмӏтнв]"), status); Also consider ԁ, Ӏ, and maybe ѵ, listed at the bottom of [1]. [1] https://en.wikipedia.org/wiki/IDN_homograph_attack#Cyrillic https://codereview.chromium.org/2683793010/diff/20001/components/url_formatte... components/url_formatter/url_formatter.cc:331: icu::UnicodeString("[аеорсухьѕіјһмӏтнв]"), status); Also consider ԁ, Ӏ, and maybe ѵ, listed at the bottom of [1]. [1] https://en.wikipedia.org/wiki/IDN_homograph_attack#Cyrillic 
 https://codereview.chromium.org/2683793010/diff/20001/components/url_formatte... File components/url_formatter/url_formatter.cc (right): https://codereview.chromium.org/2683793010/diff/20001/components/url_formatte... components/url_formatter/url_formatter.cc:331: icu::UnicodeString("[аеорсухьѕіјһмӏтнв]"), status); On 2017/02/15 at 20:24:49, lgarron wrote: > Also consider ԁ, Ӏ, and maybe ѵ, listed at the bottom of [1]. > > > [1] https://en.wikipedia.org/wiki/IDN_homograph_attack#Cyrillic (Note that Ӏ is a spoof of l, not just I.) 
 https://codereview.chromium.org/2683793010/diff/20001/components/url_formatte... File components/url_formatter/url_formatter.cc (right): https://codereview.chromium.org/2683793010/diff/20001/components/url_formatte... components/url_formatter/url_formatter.cc:209: is_tld_ascii = false; On 2017/02/15 19:57:30, sffc wrote: > I don't really understand what this part of the code is doing. It's checking if the TLD starts with 'xn--'. Otherwise, it assumes that it's a non-IDN TLD. Note that |host| is in ACE at this point. https://codereview.chromium.org/2683793010/diff/20001/components/url_formatte... components/url_formatter/url_formatter.cc:331: icu::UnicodeString("[аеорсухьѕіјһмӏтнв]"), status); On 2017/02/15 19:57:30, sffc wrote: > On 2017/02/15 18:50:21, jungshik at Google wrote: > > "м, т, н, в" look like smallcap Latin and it's debatable whether or not to > > include them in the set. > > I'd look at the capital letters too. Here's a possible list of strong matches > based on manual inspection of confusables.txt, which includes more characters > than your list: > > асԁеһіјӏорԛѕԝхуАВСЕԌНІӀЈКМОРЅТԜХҮ Well, uppercase letters will not 'survive' (they'll all case-mapped to lowercase letters before being displayed). And, one cannot register domains with uppercase letters with most (if not all) registrars. For instance, you cannot in Verisign-controlled TLDs. > And here's a possibly incomplete list of weaker matches; up to you whether you > want to include them: > > ӕъЬвҽвԍнкпгмтцѵѡүӔѴҲ Thank you for the list. Expanding the set has a risk of having too many false positives. The list is currently in flux and I'll try various sets and see how they work. https://codereview.chromium.org/2683793010/diff/20001/components/url_formatte... components/url_formatter/url_formatter.cc:331: icu::UnicodeString("[аеорсухьѕіјһмӏтнв]"), status); On 2017/02/15 20:25:56, lgarron wrote: > On 2017/02/15 at 20:24:49, lgarron wrote: > > Also consider ԁ, Ӏ, and maybe ѵ, listed at the bottom of [1]. Thanks. U+0501 and U+0475 I'll consider. As for U+04C0, see below. > > > > > > [1] https://en.wikipedia.org/wiki/IDN_homograph_attack#Cyrillic > > (Note that Ӏ is a spoof of l, not just I.) U+04C0 ( 'Ӏ') is uppercase so that it will be normalized away to lowercase (U+04CF that is already included) before being displayed. https://codereview.chromium.org/2683793010/diff/20001/components/url_formatte... components/url_formatter/url_formatter.cc:376: if (is_tld_ascii && cyrillic_letters_latin_alike_.containsAll(label_string)) On 2017/02/15 19:57:30, sffc wrote: > I think you should compare only the letter characters. For example, a spoof > string "рох-рох" that contains a non-letter character should still be considered > a spoof attack. That's a good point. Thanks ! 
 *ping* (Motivated by a bug that uses Turkish, although it's not a perfect spoof like Cyrillic: https://bugs.chromium.org/p/chromium/issues/detail?id=703750) 
 Something is not right here. My reply to comments are not included. 
 On 2017/03/21 21:51:59, jungshik at Google wrote: > Something is not right here. My reply to comments are not included. Ignore the above comment. PS 1 with Cyrillic-letters that look like Latin-small-cap rejects 10,486 domains in рф (out of 865k; > 1%). PS 1 without Cyrillic-letters that look like Latin-small-cap rejects 800 domains in рф. Limiting the Cyr-Latin look-alike check to ASCII TLDs (e.g. .com, .org, .net, etc) would make it more aggressive. Let me get back here with stats for .com/.net/.org. 
 jshin@chromium.org changed reviewers: + pkasting@chromium.org 
 Can you take a look? I'll not try to fine-tune a Latin-alike Cyrillic set to use here. This change might be short-lived given what I'm planning to do for bug 703750. (I didn't look for "look-alikes" of top (e.g. Alexa) N domains ). With PS 5 (that includes small-cap-Latin-alike Cyrillic), roughly 2,800 additional domains are displayed in punycode out of 1,042,537 IDN domains in .com. Without them (PS 4, 6-9), additional 500 ~ 600 domains are displayed in punycode. Perf: An added test increases the checking time by 1/6 (from 3us / domain => 3.5us / domain on my pretty beefy machine). (there's one potential perf improvement that I didn't measure yet). 
 LGTM https://codereview.chromium.org/2683793010/diff/180001/components/url_formatt... File components/url_formatter/url_formatter.cc (right): https://codereview.chromium.org/2683793010/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:253: // Returns true if |label| is safe to display as Unicode. When Nit: When -> When the Wrap at 80 columns. https://codereview.chromium.org/2683793010/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:330: // These Cyrillic letters look like Latin. A domain label entirely Nit: Blank line above this https://codereview.chromium.org/2683793010/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:331: // made of these letters are blocked as a poorman's whole-script-spoofable. Nit: are -> is; poorman's -> simplified https://codereview.chromium.org/2683793010/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:335: // icu::UnicodeString("[аеорсухьѕіјһмӏтнв]"), status); Looks like these comments are leftovers? https://codereview.chromium.org/2683793010/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:373: // extra check unless it contains Kana letter exceptions or it's made enitrely Nit: entirely https://codereview.chromium.org/2683793010/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:374: // of Cyrillic letters that look alike Latin letters. Note that Nit: alike -> like https://codereview.chromium.org/2683793010/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:445: // would work in most cases, but it'd not if a label has non-letters outside Nit: Remove "it'd" https://codereview.chromium.org/2683793010/diff/180001/components/url_formatt... components/url_formatter/url_formatter.cc:449: UChar32 c; Nit: Define inside loop (and can then be const) 
 Thanks, Peter for the review. What's the best practice of landing a 'private' CL? I'm tempted to open it up and run through trybots (even though I ran the affected test locally) because once landed, this change will be visible to the public anyway. 
 On 2017/03/23 at 20:10:17, jshin wrote: > Thanks, Peter for the review. > > What's the best practice of landing a 'private' CL? > > I'm tempted to open it up and run through trybots (even though I ran the affected test locally) because once landed, this change will be visible to the public anyway. All security CLs are public when they land at the latest; there's no need to keep this secret (especially since IDN spoofing is a known problem in general, and isn't a dangerous new vulnerability). 
 The CQ bit was checked by jshin@chromium.org 
 The patchset sent to the CQ was uploaded after l-g-t-m from pkasting@chromium.org Link to the patchset: https://codereview.chromium.org/2683793010/#ps240001 (title: "go back to ps11") 
 CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or... 
 On 2017/03/23 20:12:12, lgarron wrote: > On 2017/03/23 at 20:10:17, jshin wrote: > > Thanks, Peter for the review. > > > > What's the best practice of landing a 'private' CL? > > > > I'm tempted to open it up and run through trybots (even though I ran the > affected test locally) because once landed, this change will be visible to the > public anyway. > > All security CLs are public when they land at the latest; there's no need to > keep this secret (especially since IDN spoofing is a known problem in general, > and isn't a dangerous new vulnerability). Ok. Thank you for the answer. I lifted the protection and added this to CQ. 
 CQ is committing da patch.
Bot data: {"patchset_id": 240001, "attempt_start_ts": 1490300001173350,
"parent_rev": "1901afa5e03bad9ce8a72691f746a685038241b0", "commit_rev":
"08cb718ba7c3961c1006176c9faba0a5841ec792"}
 
            
              
                Message was sent while issue was closed.
              
            
             Description was changed from ========== Block domain labels made of Cyrillic letters that look alike Latin Block a label made entirely of Latin-look-alike Cyrillic letters when the TLD is not an IDN (i.e. this check is ON only for TLDs like 'com', 'net', 'uk', but not applied for IDN TLDs like рф. BUG=683314 TEST=components_unittests --gtest_filter=U*IDN* ========== to ========== Block domain labels made of Cyrillic letters that look alike Latin Block a label made entirely of Latin-look-alike Cyrillic letters when the TLD is not an IDN (i.e. this check is ON only for TLDs like 'com', 'net', 'uk', but not applied for IDN TLDs like рф. BUG=683314 TEST=components_unittests --gtest_filter=U*IDN* Review-Url: https://codereview.chromium.org/2683793010 Cr-Commit-Position: refs/heads/master@{#459226} Committed: https://chromium.googlesource.com/chromium/src/+/08cb718ba7c3961c1006176c9fab... ========== 
 
            
              
                Message was sent while issue was closed.
              
            
             Committed patchset #13 (id:240001) as https://chromium.googlesource.com/chromium/src/+/08cb718ba7c3961c1006176c9fab... | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
