|
|
Created:
5 years, 5 months ago by alshabalin Modified:
5 years, 5 months ago CC:
chromium-apps-reviews_chromium.org, chromium-reviews, extensions-reviews_chromium.org, felt Base URL:
https://chromium.googlesource.com/chromium/src.git@lkgr Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
DescriptionUnpunycode search keywords and short names.
When automatically generating a new search engine, convert keywords
and short names from punycode to unicode.
BUG=510832
Committed: https://crrev.com/e6212ac6b82fe6e2d73cd1e112b016fa45343927
Cr-Commit-Position: refs/heads/master@{#340053}
Patch Set 1 #Patch Set 2 : Rebased to origin/master. #
Total comments: 18
Patch Set 3 : Remove UTF-8 from source code, use std::string(). #
Total comments: 15
Patch Set 4 : Restrict IDN-decoding to keywords generated from URL; Use prefs::kAcceptLanguages for IDN-decoding #
Total comments: 13
Patch Set 5 : Fix comments, naming, code style. #
Total comments: 2
Patch Set 6 : Fix grammar. #Messages
Total messages: 41 (8 generated)
alshabalin@yandex-team.ru changed reviewers: + jochen@chromium.org, stevenjb@chromium.org, stevet@chromium.org
LGTM for search engines - provided that there is another reviewer who understands the Punycode conversion methods better than I do.
jochen@chromium.org changed reviewers: + felt@chromium.org, pkasting@chromium.org
+pkasting because omnibox affected +felt because this might be security ux relevant
Is there an associated BUG= here? https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensio... File chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc (right): https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensio... chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc:124: // Unpunycode short name and keyword. Why do we need to do this for these particular cases, but we don't go so far as to, say, run IDNToUnicode() unconditionally inside SetShortName() and SetKeyword()? https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensio... chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc:125: data.SetShortName(net::IDNToUnicode(parameters->name, "")); Nit: Use std::string() in place of "" everywhere in this change. https://codereview.chromium.org/1238683003/diff/20001/components/search_engin... File components/search_engines/template_url.cc (right): https://codereview.chromium.org/1238683003/diff/20001/components/search_engin... components/search_engines/template_url.cc:1221: base::string16 keyword(net::StripWWW(net::IDNToUnicode(url.host(), ""))); I think we should pass the user's accept-languages here, so we'll un-punycode more cases that the user should understand. This should probably be passed to this function as a second arg, and so on up the call chain until you get to a chrome-side caller that can compute this as: profile_->GetPrefs()->GetString(prefs::kAcceptLanguages) Another route in some cases would be to add an embedder client method to get these languages, then have the components-side code that needs to call the client method get the embedder to instantiate such a client instance for it. Yet another route would be to add this information, or accessors that compute it, to SearchTermsData. https://codereview.chromium.org/1238683003/diff/20001/components/search_engin... File components/search_engines/template_url_parser.cc (right): https://codereview.chromium.org/1238683003/diff/20001/components/search_engin... components/search_engines/template_url_parser.cc:257: base::IsStringASCII(context->string_) Why do an ASCII check here? Why not just unconditionally go the IDNToUnicode() route as you have elsewhere? If there's an efficiency issue, perhaps IDNToUnicode() should do an ASCII check and early-out itself? Or maybe we need to insert these sorts of conditionals in other places too? https://codereview.chromium.org/1238683003/diff/20001/components/search_engin... File components/search_engines/template_url_unittest.cc (right): https://codereview.chromium.org/1238683003/diff/20001/components/search_engin... components/search_engines/template_url_unittest.cc:1708: ASSERT_EQ(base::UTF8ToUTF16("абв"), Don't put Unicode characters directly in source files; use \x escapes or similar.
On 2015/07/14 18:05:33, Peter Kasting wrote: > Is there an associated BUG= here? Couldn't find one. Should I create it?
https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensio... File chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc (right): https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensio... chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc:124: // Unpunycode short name and keyword. On 2015/07/14 18:05:33, Peter Kasting wrote: > Why do we need to do this for these particular cases, but we don't go so far as > to, say, run IDNToUnicode() unconditionally inside SetShortName() and > SetKeyword()? Because I want to perform IDN conversion on user input only. For JS API here and for site-supplied OpenSearch data (in TemplateURLParser) I defensively run this conversion to catch possible mistakes from the site maintainer: e.g. when site returns OpenSearch xml with ShortName in punycode (probably auto-generated from the hostname), I want the end user to see it in the decoded form. https://codereview.chromium.org/1238683003/diff/20001/components/search_engin... File components/search_engines/template_url.cc (right): https://codereview.chromium.org/1238683003/diff/20001/components/search_engin... components/search_engines/template_url.cc:1221: base::string16 keyword(net::StripWWW(net::IDNToUnicode(url.host(), ""))); On 2015/07/14 18:05:33, Peter Kasting wrote: > I think we should pass the user's accept-languages here, so we'll un-punycode > more cases that the user should understand. > > This should probably be passed to this function as a second arg, and so on up > the call chain until you get to a chrome-side caller that can compute this as: > > profile_->GetPrefs()->GetString(prefs::kAcceptLanguages) > > Another route in some cases would be to add an embedder client method to get > these languages, then have the components-side code that needs to call the > client method get the embedder to instantiate such a client instance for it. > Yet another route would be to add this information, or accessors that compute > it, to SearchTermsData. My thought was that a string in a language user does not read and a string in IDN form are equally incomprehensible but the first one looks nicer. After all, punycode is there only because hostname must be ASCII and as such is an implementation detail. Or is there some guideline that we should unpunycode only to the languages the user understands? https://codereview.chromium.org/1238683003/diff/20001/components/search_engin... File components/search_engines/template_url_parser.cc (right): https://codereview.chromium.org/1238683003/diff/20001/components/search_engin... components/search_engines/template_url_parser.cc:257: base::IsStringASCII(context->string_) On 2015/07/14 18:05:33, Peter Kasting wrote: > Why do an ASCII check here? Why not just unconditionally go the IDNToUnicode() > route as you have elsewhere? > > If there's an efficiency issue, perhaps IDNToUnicode() should do an ASCII check > and early-out itself? Or maybe we need to insert these sorts of conditionals in > other places too? IDNToUnicode requires ASCII because it calls ASCIIToUTF16 on its argument. Here it's the data from OpenSearch xml which could contain unicode. In other places it is guaranteed to be in ASCII: in JS API it previously called ASCIIToUTF16, in TemplateURL::GenerateKeyword it uses hostname. But it might indeed make sense to move the ASCII check to IDNToUnicode.
https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensio... File chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc (right): https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensio... chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc:125: data.SetShortName(net::IDNToUnicode(parameters->name, "")); On 2015/07/14 18:05:33, Peter Kasting wrote: > Nit: Use std::string() in place of "" everywhere in this change. Done. https://codereview.chromium.org/1238683003/diff/20001/components/search_engin... File components/search_engines/template_url_unittest.cc (right): https://codereview.chromium.org/1238683003/diff/20001/components/search_engin... components/search_engines/template_url_unittest.cc:1708: ASSERT_EQ(base::UTF8ToUTF16("абв"), On 2015/07/14 18:05:33, Peter Kasting wrote: > Don't put Unicode characters directly in source files; use \x escapes or > similar. Done.
On 2015/07/15 14:17:35, alshabalin wrote: > On 2015/07/14 18:05:33, Peter Kasting wrote: > > Is there an associated BUG= here? > > Couldn't find one. Should I create it? Yes please.
https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensio... File chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc (right): https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensio... chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc:124: // Unpunycode short name and keyword. On 2015/07/15 at 14:18:01, alshabalin wrote: > On 2015/07/14 18:05:33, Peter Kasting wrote: > > Why do we need to do this for these particular cases, but we don't go so far as > > to, say, run IDNToUnicode() unconditionally inside SetShortName() and > > SetKeyword()? > > Because I want to perform IDN conversion on user input only. For JS API here and > for site-supplied OpenSearch data (in TemplateURLParser) I defensively run this > conversion to catch possible mistakes from the site maintainer: e.g. when site > returns OpenSearch xml with ShortName in punycode (probably auto-generated from > the hostname), I want the end user to see it in the decoded form. I don't understand your paragraph. It seems to first say you want to only un-punycode user input, then shortly after that you want to un-punycode non-user-input (e.g. stuff in the OSDD). https://codereview.chromium.org/1238683003/diff/20001/components/search_engin... File components/search_engines/template_url.cc (right): https://codereview.chromium.org/1238683003/diff/20001/components/search_engin... components/search_engines/template_url.cc:1221: base::string16 keyword(net::StripWWW(net::IDNToUnicode(url.host(), ""))); On 2015/07/15 at 14:18:01, alshabalin wrote: > On 2015/07/14 18:05:33, Peter Kasting wrote: > > I think we should pass the user's accept-languages here, so we'll un-punycode > > more cases that the user should understand. > > > > This should probably be passed to this function as a second arg, and so on up > > the call chain until you get to a chrome-side caller that can compute this as: > > > > profile_->GetPrefs()->GetString(prefs::kAcceptLanguages) > > > > Another route in some cases would be to add an embedder client method to get > > these languages, then have the components-side code that needs to call the > > client method get the embedder to instantiate such a client instance for it. > > Yet another route would be to add this information, or accessors that compute > > it, to SearchTermsData. > > My thought was that a string in a language user does not read and a string in > IDN form are equally incomprehensible but the first one looks nicer. After all, > punycode is there only because hostname must be ASCII and as such is an > implementation detail. > > Or is there some guideline that we should unpunycode only to the languages > the user understands? It seems like you're misunderstanding how the conversion works. If you pass no languages, we will fail to render as Unicode any hostnames that mix characters from different scripts. If you pass the languages the user understands, then we'll be willing to show as Unicode hostnames which use characters from those scripts. Passing the empty string doesn't mean "un-punycode everything", it means something closer to the opposite. In other words, I'm trying to get your code to un-punycode more hostnames, not fewer. https://codereview.chromium.org/1238683003/diff/20001/components/search_engin... File components/search_engines/template_url_parser.cc (right): https://codereview.chromium.org/1238683003/diff/20001/components/search_engin... components/search_engines/template_url_parser.cc:257: base::IsStringASCII(context->string_) On 2015/07/15 at 14:18:01, alshabalin wrote: > On 2015/07/14 18:05:33, Peter Kasting wrote: > > Why do an ASCII check here? Why not just unconditionally go the IDNToUnicode() > > route as you have elsewhere? > > > > If there's an efficiency issue, perhaps IDNToUnicode() should do an ASCII check > > and early-out itself? Or maybe we need to insert these sorts of conditionals in > > other places too? > > IDNToUnicode requires ASCII because it calls ASCIIToUTF16 on its argument. Here > it's the data from OpenSearch xml which could contain unicode. In other places > it is guaranteed to be in ASCII: in JS API it previously called ASCIIToUTF16, in > TemplateURL::GenerateKeyword it uses hostname. > > But it might indeed make sense to move the ASCII check to IDNToUnicode. At the least you need a comment in all these places noting why you do or do not need an ASCII check.
https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensio... File chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc (right): https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensio... chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc:124: // Unpunycode short name and keyword. On 2015/07/15 21:19:12, Peter Kasting wrote: > On 2015/07/15 at 14:18:01, alshabalin wrote: > > On 2015/07/14 18:05:33, Peter Kasting wrote: > > > Why do we need to do this for these particular cases, but we don't go so far > as > > > to, say, run IDNToUnicode() unconditionally inside SetShortName() and > > > SetKeyword()? > > > > Because I want to perform IDN conversion on user input only. For JS API here > and > > for site-supplied OpenSearch data (in TemplateURLParser) I defensively run > this > > conversion to catch possible mistakes from the site maintainer: e.g. when site > > returns OpenSearch xml with ShortName in punycode (probably auto-generated > from > > the hostname), I want the end user to see it in the decoded form. > > I don't understand your paragraph. It seems to first say you want to only > un-punycode user input, then shortly after that you want to un-punycode > non-user-input (e.g. stuff in the OSDD). I was thinking server-side user at the time and actually meant input from OSDD. It is confusing, sorry. Anyway, my point is to leave the decision whether to un-punycode or not to the caller because un-punycoding ShortName is not a normal behaviour but a workaround to fix accidental mistakes for the end-user benefit. I can't come up with an example when doing the conversion would hurt though. https://codereview.chromium.org/1238683003/diff/20001/components/search_engin... File components/search_engines/template_url.cc (right): https://codereview.chromium.org/1238683003/diff/20001/components/search_engin... components/search_engines/template_url.cc:1221: base::string16 keyword(net::StripWWW(net::IDNToUnicode(url.host(), ""))); On 2015/07/15 21:19:12, Peter Kasting wrote: > On 2015/07/15 at 14:18:01, alshabalin wrote: > > On 2015/07/14 18:05:33, Peter Kasting wrote: > > > I think we should pass the user's accept-languages here, so we'll > un-punycode > > > more cases that the user should understand. > > > > > > This should probably be passed to this function as a second arg, and so on > up > > > the call chain until you get to a chrome-side caller that can compute this > as: > > > > > > profile_->GetPrefs()->GetString(prefs::kAcceptLanguages) > > > > > > Another route in some cases would be to add an embedder client method to get > > > these languages, then have the components-side code that needs to call the > > > client method get the embedder to instantiate such a client instance for it. > > > > Yet another route would be to add this information, or accessors that > compute > > > it, to SearchTermsData. > > > > My thought was that a string in a language user does not read and a string in > > IDN form are equally incomprehensible but the first one looks nicer. After > all, > > punycode is there only because hostname must be ASCII and as such is an > > implementation detail. > > > > Or is there some guideline that we should unpunycode only to the languages > > the user understands? > > It seems like you're misunderstanding how the conversion works. If you pass no > languages, we will fail to render as Unicode any hostnames that mix characters > from different scripts. If you pass the languages the user understands, then > we'll be willing to show as Unicode hostnames which use characters from those > scripts. Passing the empty string doesn't mean "un-punycode everything", it > means something closer to the opposite. > > In other words, I'm trying to get your code to un-punycode more hostnames, not > fewer. I thought script-mixing in hostnames is not allowed anyway because it opens up more ways to spoof. But apparently I'm wrong https://tools.ietf.org/html/rfc5890#section-4.4 My mistake, sorry, will rewrite to pass accepted languages then.
On 2015/07/15 16:16:41, felt wrote: > On 2015/07/15 14:17:35, alshabalin wrote: > > On 2015/07/14 18:05:33, Peter Kasting wrote: > > > Is there an associated BUG= here? > > > > Couldn't find one. Should I create it? > > Yes please. Done. https://code.google.com/p/chromium/issues/detail?id=510832
felt@chromium.org changed reviewers: + mgiuca@chromium.org - felt@chromium.org
felt@chromium.org changed reviewers: + felt@chromium.org
mgiuca, could you take a look at this please?
https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensio... File chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc (right): https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensio... chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc:124: // Unpunycode short name and keyword. On 2015/07/16 at 15:13:07, alshabalin wrote: > On 2015/07/15 21:19:12, Peter Kasting wrote: > > On 2015/07/15 at 14:18:01, alshabalin wrote: > > > On 2015/07/14 18:05:33, Peter Kasting wrote: > > > > Why do we need to do this for these particular cases, but we don't go so far > > as > > > > to, say, run IDNToUnicode() unconditionally inside SetShortName() and > > > > SetKeyword()? > > > > > > Because I want to perform IDN conversion on user input only. For JS API here > > and > > > for site-supplied OpenSearch data (in TemplateURLParser) I defensively run > > this > > > conversion to catch possible mistakes from the site maintainer: e.g. when site > > > returns OpenSearch xml with ShortName in punycode (probably auto-generated > > from > > > the hostname), I want the end user to see it in the decoded form. > > > > I don't understand your paragraph. It seems to first say you want to only > > un-punycode user input, then shortly after that you want to un-punycode > > non-user-input (e.g. stuff in the OSDD). > > I was thinking server-side user at the time and actually meant input from OSDD. > It is confusing, sorry. Anyway, my point is to leave the decision whether to > un-punycode or not to the caller because un-punycoding ShortName is not > a normal behaviour but a workaround to fix accidental mistakes for the end-user > benefit. I can't come up with an example when doing the conversion would hurt > though. Well, it's not normal in the sense that we don't currently do it, but I'm not convinced it's ever wrong. I'm inclined to suggest centralizing the conversion so it just always happens. However, that might make getting the correct languages tricky, so it may be that the current solution is easier to implement.
Hi, thanks for including me, felt@. I have a problem with *some* of the decodings in this CL, as indicated in my comments. This CL attempts to change three conversions: 1. In the searchEnginePrivate.addOtherSearchEngine private API function, IDN-decode the "name" and "keyword" arguments before storing them in the search engines database. (I believe this API is used by the Settings page when the user adds or alters a search engine.) 2. When a search engine is automatically generated from a domain name, IDN-decode the domain name before storing it as the short name / keyword in the search engines database. 3. When parsing an OpenSearch description document (XML), IDN-decode the ShortName field before storing it in the search engines database. As I've noted in the comments, I believe #2 is a good change, but #1 and #3 are invalid. If you are converting a domain name into a human-readable string (as in #2), then IDN-decoding is a necessary step. But #1 and #3 are not processing domain names, they are dealing with ordinary strings specifically intended for use as a search engine short name / keyword, so why treat them as a domain name? It looks like the case in the bug report (http://crbug.com/510832) of "http://центрсвязи.рф" does not have an OpenSearch XML file, so the problem comes from auto-generating the keyword from a domain name. That will be fixed by #2. Any site that has an OpenSearch XML file should not need fixing because the ShortName field should not be IDN-encoded (and if it is, it is a problem with the site, not something we should be fixing on the user agent side). Please see my detailed comments. https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensio... File chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc (right): https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensio... chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc:124: // Unpunycode short name and keyword. It sounds like you're saying that because site maintainers might make the mistake of putting Punycode in the XML ShortName field, we should pro-actively decode it for them...? That sounds like a Bad Thing. The OpenSearch spec [1] does not say that this should be Punycode-encoded, or that it is a domain name. It simply says that it's "a brief human-readable title". You could make a similar argument for decoding percent-escaped text in this field, or performing any other type of unescaping, "just in case" some badly formatted data was accidentally encoded. But that's not how user agents should deal with data. (Conversely: do you have any evidence of high-profile sites that have a Punycoded domain in their ShortName field?) [1] http://www.opensearch.org/Specifications/OpenSearch/1.1#The_.22ShortName.22_e... https://codereview.chromium.org/1238683003/diff/40001/chrome/browser/extensio... File chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc (right): https://codereview.chromium.org/1238683003/diff/40001/chrome/browser/extensio... chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc:127: data.SetShortName(net::IDNToUnicode(parameters->name, std::string())); This should not be unpunycoded. The name and keyword parameters are not domains and should not be expected to be Punycode-encoded (they are being passed in through an API). If the rationale for this change is that users might type in an IDN-encoded domain into the "keyword" or "short name" field, and that will get passed through this API, and we want to decode those, I would say: a) that is not our concern. The user typed/pasted a string into a text field, and we should respect that string. There is no reason to decode the string in this text field any more than if they typed/pasted it into any other text field. b) if we did want to decode that user input, it should be done at the user interface level, not at the API level. You should do the decoding in the settings page UI, not here. https://codereview.chromium.org/1238683003/diff/40001/chrome/browser/extensio... chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc:162: turl, base::UTF8ToUTF16(parameters->name), If you're going to change addOtherSearchEngine, you also need to change this behaviour (updateSearchEngine). (But I'm telling you to revert the above change, so you shouldn't have to do anything here.) https://codereview.chromium.org/1238683003/diff/40001/components/search_engin... File components/search_engines/template_url.cc (right): https://codereview.chromium.org/1238683003/diff/40001/components/search_engin... components/search_engines/template_url.cc:1220: // We do not want any punycode in any language. This comment should be more detailed and should refer to IDN, not Punycode. How about: "The URL host may be an internationalized domain name (IDN). If so, decode it into a regular Unicode string." You should not be using the term "Punycode" here. Punycode is a sub-step in the IDN encoding/decoding algorithm. Roughly, the IDN encoding algorithm is this: 1. Split the domain name by '.' character. 2. For any segment that contains non-ASCII characters, Punycode-encode it and prepend "xn--". 3. Join the segments by '.' characters. So what you are doing here is not Punycode-decoding, but IDN-decoding. Also I'd like to see it separated from the above comment about stripping "www". Maybe break this out like so: base::string16 keyword = url.host(); // <comment about decoding IDN> keyword = net::IDNToUnicode(keyword, std::string()); // <comment about stripping "www"> keyword = net::StripWWW(keyword); https://codereview.chromium.org/1238683003/diff/40001/components/search_engin... File components/search_engines/template_url_fetcher_unittest.cc (right): https://codereview.chromium.org/1238683003/diff/40001/components/search_engin... components/search_engines/template_url_fetcher_unittest.cc:335: std::string osdd_file_name("punycode_open_search.xml"); This test is not necessary since we do not expect ShortNames to be IDN-encoded. (And if they are, we should not do anything special in that case.) https://codereview.chromium.org/1238683003/diff/40001/components/search_engin... components/search_engines/template_url_fetcher_unittest.cc:344: TEST_F(TemplateURLFetcherTest, UnicodeTest) { This is a good test! I'm kind of surprised that none of the other tests in this file actually check *any* fields of the TemplateURL object they get back from GetTemplateURLForKeyword. If you're interested, you could go to the above tests and add some assertions to that effect. (i.e., this is the very first assertion in this file for t-url->short_name().) https://codereview.chromium.org/1238683003/diff/40001/components/search_engin... components/search_engines/template_url_fetcher_unittest.cc:356: EXPECT_EQ(keyword, t_url->short_name()); It's not clear why you expect short_name to equal keyword (other than the coincidence that the keyword and short name are the same string). For a more robust test, can you change the keyword to be a different string to the short name, and introduce a new string "expected_short_name" for this test? https://codereview.chromium.org/1238683003/diff/40001/components/search_engin... File components/search_engines/template_url_parser.cc (right): https://codereview.chromium.org/1238683003/diff/40001/components/search_engin... components/search_engines/template_url_parser.cc:258: base::IsStringASCII(context->string_) This should not be unpunycoded. The ShortName field is not a domain and should not be expected to be Punycode-encoded. https://codereview.chromium.org/1238683003/diff/40001/components/test/data/se... File components/test/data/search_engines/punycode_open_search.xml (right): https://codereview.chromium.org/1238683003/diff/40001/components/test/data/se... components/test/data/search_engines/punycode_open_search.xml:1: <?xml version="1.0"?> Delete this file, as discussed above.
I'm probably not as convinced of the invalidity of aggressive un-punycoding as mgiuca is, but that also means I'm willing to be convinced he's right. Are there, indeed, some OSDDs in the real world with punycoded strings in these fields? If not, then sticking to the narrow case of only decoding when we autogenerate these values seems like the right move.
On 2015/07/17 04:13:11, Matt Giuca wrote: > Hi, thanks for including me, felt@. > > I have a problem with *some* of the decodings in this CL, as indicated in my > comments. This CL attempts to change three conversions: > > 1. In the searchEnginePrivate.addOtherSearchEngine private API function, > IDN-decode the "name" and "keyword" arguments before storing them in the search > engines database. (I believe this API is used by the Settings page when the user > adds or alters a search engine.) > 2. When a search engine is automatically generated from a domain name, > IDN-decode the domain name before storing it as the short name / keyword in the > search engines database. > 3. When parsing an OpenSearch description document (XML), IDN-decode the > ShortName field before storing it in the search engines database. > > As I've noted in the comments, I believe #2 is a good change, but #1 and #3 are > invalid. If you are converting a domain name into a human-readable string (as in > #2), then IDN-decoding is a necessary step. But #1 and #3 are not processing > domain names, they are dealing with ordinary strings specifically intended for > use as a search engine short name / keyword, so why treat them as a domain name? About #1: My initial thought was that this API was to become a way for sites to interactively add themselves as search engines. But I see I'm wrong http://crbug.com/480043 Ok, will remove. > It looks like the case in the bug report (http://crbug.com/510832) of > "http://центрсвязи.рф" does not have an OpenSearch XML file, so the problem > comes from auto-generating the keyword from a domain name. That will be fixed by > #2. Any site that has an OpenSearch XML file should not need fixing because the > ShortName field should not be IDN-encoded (and if it is, it is a problem with > the site, not something we should be fixing on the user agent side). I have but one example: the other site mentioned in the bug report "http://новостиворонежа.рф" falls into #3 category. It delivers OpenSearch XML roughly via this url http://yandex.ru/sitesearch/opensearch.xml?ShortName=xn--80adamfj7adfbcazkp.x... Notice that it erroneously asks it to set ShortName in IDN-encoding instead of URL-encoding. This site however is not high profile which is probably why this has gone unnoticed. The decision to do IDN-conversion client-side anyway was made because it doesn't seem to complicate code much and doesn't hurt existing functionality while (slightly) improving end-user experience. If you still want it gone, I will remove it.
felt@chromium.org changed reviewers: - felt@chromium.org
On 2015/07/17 09:39:57, alshabalin wrote: > > About #1: My initial thought was that this API was to become a way for sites to > interactively add themselves as search engines. But I see I'm wrong > http://crbug.com/480043 Ok, will remove. Even if it were an API for sites to interactively add themselves, we would want them to be passing normal strings to that API, not IDN-encoded strings. So we wouldn't want this here either way. > > It looks like the case in the bug report (http://crbug.com/510832) of > > "http://центрсвязи.рф" does not have an OpenSearch XML file, so the problem > > comes from auto-generating the keyword from a domain name. That will be fixed > by > > #2. Any site that has an OpenSearch XML file should not need fixing because > the > > ShortName field should not be IDN-encoded (and if it is, it is a problem with > > the site, not something we should be fixing on the user agent side). > > I have but one example: the other site mentioned in the bug report > "http://новостиворонежа.рф" falls into #3 category. It delivers OpenSearch XML > roughly via this url > http://yandex.ru/sitesearch/opensearch.xml?ShortName=xn--80adamfj7adfbcazkp.x... OK, I see that it does that. (It seems to do it client-side, because the "view source" version of the page doesn't have the link, but the DOM tree does.) > Notice that it erroneously asks it to set ShortName in IDN-encoding instead of > URL-encoding. > This site however is not high profile which is probably why this has gone > unnoticed. > The decision to do IDN-conversion client-side anyway was made because it doesn't > seem to complicate code much and doesn't hurt existing functionality while > (slightly) > improving end-user experience. We can't add new browser functionality (no matter how simple the code is) because one site (or a handful of sites) are doing the wrong thing. That fragments the web. It means we're giving those sites our blessing to continue doing the wrong thing, at a cost of compatibility with other browsers. Instead, those sites should change their behaviour to match the existing specification. For analogy: suppose you found a site (or three sites) that had an IDN-encoded <title> element, which was causing a user-visible unreadable ASCII string to appear in the browser's title bar, instead of the site's title. Should we change the parser to always attempt to IDN-decode all <title> elements on all HTML pages, to "fix" this problem? No, because this moves us away from the standard and from the behaviour of other browsers, and makes the problem worse (as we may encourage more sites to IDN-encode their titles). Instead, those sites should fix their broken HTML. The OpenSearch XML is the same issue, but less visible. On the other hand, if you can show that there is a widespread practice of sites IDN-encoding the ShortName field, then there is a case to make this change, but it should probably be done in consultation with the OpenSearch standard authors and other browser vendors (i.e. not in this CL). > If you still want it gone, I will remove it. Yes, please.
https://codereview.chromium.org/1238683003/diff/20001/components/search_engin... File components/search_engines/template_url.cc (right): https://codereview.chromium.org/1238683003/diff/20001/components/search_engin... components/search_engines/template_url.cc:1221: base::string16 keyword(net::StripWWW(net::IDNToUnicode(url.host(), ""))); On 2015/07/15 21:19:12, Peter Kasting wrote: > On 2015/07/15 at 14:18:01, alshabalin wrote: > > On 2015/07/14 18:05:33, Peter Kasting wrote: > > > I think we should pass the user's accept-languages here, so we'll > un-punycode > > > more cases that the user should understand. > > > > > > This should probably be passed to this function as a second arg, and so on > up > > > the call chain until you get to a chrome-side caller that can compute this > as: > > > > > > profile_->GetPrefs()->GetString(prefs::kAcceptLanguages) > > > > > > Another route in some cases would be to add an embedder client method to get > > > these languages, then have the components-side code that needs to call the > > > client method get the embedder to instantiate such a client instance for it. > > > > Yet another route would be to add this information, or accessors that > compute > > > it, to SearchTermsData. > > > > My thought was that a string in a language user does not read and a string in > > IDN form are equally incomprehensible but the first one looks nicer. After > all, > > punycode is there only because hostname must be ASCII and as such is an > > implementation detail. > > > > Or is there some guideline that we should unpunycode only to the languages > > the user understands? > > It seems like you're misunderstanding how the conversion works. If you pass no > languages, we will fail to render as Unicode any hostnames that mix characters > from different scripts. If you pass the languages the user understands, then > we'll be willing to show as Unicode hostnames which use characters from those > scripts. Passing the empty string doesn't mean "un-punycode everything", it > means something closer to the opposite. > > In other words, I'm trying to get your code to un-punycode more hostnames, not > fewer. Done. https://codereview.chromium.org/1238683003/diff/40001/chrome/browser/extensio... File chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc (right): https://codereview.chromium.org/1238683003/diff/40001/chrome/browser/extensio... chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc:127: data.SetShortName(net::IDNToUnicode(parameters->name, std::string())); On 2015/07/17 04:13:11, Matt Giuca wrote: > This should not be unpunycoded. The name and keyword parameters are not domains > and should not be expected to be Punycode-encoded (they are being passed in > through an API). > > If the rationale for this change is that users might type in an IDN-encoded > domain into the "keyword" or "short name" field, and that will get passed > through this API, and we want to decode those, I would say: > a) that is not our concern. The user typed/pasted a string into a text field, > and we should respect that string. There is no reason to decode the string in > this text field any more than if they typed/pasted it into any other text field. > b) if we did want to decode that user input, it should be done at the user > interface level, not at the API level. You should do the decoding in the > settings page UI, not here. Done. https://codereview.chromium.org/1238683003/diff/40001/components/search_engin... File components/search_engines/template_url.cc (right): https://codereview.chromium.org/1238683003/diff/40001/components/search_engin... components/search_engines/template_url.cc:1220: // We do not want any punycode in any language. On 2015/07/17 04:13:11, Matt Giuca wrote: > This comment should be more detailed and should refer to IDN, not Punycode. How > about: > "The URL host may be an internationalized domain name (IDN). If so, decode it > into a regular Unicode string." > > You should not be using the term "Punycode" here. Punycode is a sub-step in the > IDN encoding/decoding algorithm. Roughly, the IDN encoding algorithm is this: > 1. Split the domain name by '.' character. > 2. For any segment that contains non-ASCII characters, Punycode-encode it and > prepend "xn--". > 3. Join the segments by '.' characters. > > So what you are doing here is not Punycode-decoding, but IDN-decoding. > > Also I'd like to see it separated from the above comment about stripping "www". > Maybe break this out like so: > > base::string16 keyword = url.host(); > // <comment about decoding IDN> > keyword = net::IDNToUnicode(keyword, std::string()); > // <comment about stripping "www"> > keyword = net::StripWWW(keyword); Done. https://codereview.chromium.org/1238683003/diff/40001/components/search_engin... File components/search_engines/template_url_fetcher_unittest.cc (right): https://codereview.chromium.org/1238683003/diff/40001/components/search_engin... components/search_engines/template_url_fetcher_unittest.cc:335: std::string osdd_file_name("punycode_open_search.xml"); On 2015/07/17 04:13:11, Matt Giuca wrote: > This test is not necessary since we do not expect ShortNames to be IDN-encoded. > (And if they are, we should not do anything special in that case.) Done. https://codereview.chromium.org/1238683003/diff/40001/components/search_engin... components/search_engines/template_url_fetcher_unittest.cc:344: TEST_F(TemplateURLFetcherTest, UnicodeTest) { On 2015/07/17 04:13:11, Matt Giuca wrote: > This is a good test! I'm kind of surprised that none of the other tests in this > file actually check *any* fields of the TemplateURL object they get back from > GetTemplateURLForKeyword. If you're interested, you could go to the above tests > and add some assertions to that effect. > > (i.e., this is the very first assertion in this file for t-url->short_name().) Added a short_name() assertion to BasicAutodectedTest. https://codereview.chromium.org/1238683003/diff/40001/components/search_engin... components/search_engines/template_url_fetcher_unittest.cc:356: EXPECT_EQ(keyword, t_url->short_name()); On 2015/07/17 04:13:11, Matt Giuca wrote: > It's not clear why you expect short_name to equal keyword (other than the > coincidence that the keyword and short name are the same string). For a more > robust test, can you change the keyword to be a different string to the short > name, and introduce a new string "expected_short_name" for this test? Done. (Although I didn't introduce expected_short_name, making expectation base::UTF8ToUTF16 of a string literal) https://codereview.chromium.org/1238683003/diff/40001/components/search_engin... File components/search_engines/template_url_parser.cc (right): https://codereview.chromium.org/1238683003/diff/40001/components/search_engin... components/search_engines/template_url_parser.cc:258: base::IsStringASCII(context->string_) On 2015/07/17 04:13:11, Matt Giuca wrote: > This should not be unpunycoded. The ShortName field is not a domain and should > not be expected to be Punycode-encoded. Done. https://codereview.chromium.org/1238683003/diff/40001/components/test/data/se... File components/test/data/search_engines/punycode_open_search.xml (right): https://codereview.chromium.org/1238683003/diff/40001/components/test/data/se... components/test/data/search_engines/punycode_open_search.xml:1: <?xml version="1.0"?> On 2015/07/17 04:13:11, Matt Giuca wrote: > Delete this file, as discussed above. Done.
Miguel, in fairness, are other browsers really a factor here? Chrome is more aggressive than they are (I thought) about using OSDDs for some interesting purpose. Is it easy to demonstrate Firefox and IE also showing punycoded names for the sample site that was posted here?
On 2015/07/20 at 16:47:17, Peter Kasting wrote: > Miguel, Er, Matt. I don't know why I always want to write "Miguel", I think it's just visually similar to your username. Sorry!
stevenjb@chromium.org changed reviewers: - stevenjb@chromium.org
https://codereview.chromium.org/1238683003/diff/60001/chrome/browser/search_e... File chrome/browser/search_engines/ui_thread_search_terms_data.cc (right): https://codereview.chromium.org/1238683003/diff/60001/chrome/browser/search_e... chrome/browser/search_engines/ui_thread_search_terms_data.cc:188: return profile_->GetPrefs()->GetString(prefs::kAcceptLanguages); Nit: Shorter: return profile_ ? profile_->GetPrefs()->GetString(prefs::kAcceptLanguages) : std::string(); https://codereview.chromium.org/1238683003/diff/60001/chrome/browser/search_e... File chrome/browser/search_engines/ui_thread_search_terms_data.h (right): https://codereview.chromium.org/1238683003/diff/60001/chrome/browser/search_e... chrome/browser/search_engines/ui_thread_search_terms_data.h:36: std::string GetAcceptedLanguages() const override; Call this GetAcceptLanguages(), not "accepted", because "accept-languages" is the meaningful phrase in question, and we use that terminology elsewhere. (I know grammatically this is weird.) https://codereview.chromium.org/1238683003/diff/60001/chrome/browser/ui/searc... File chrome/browser/ui/search_engines/search_engine_tab_helper.cc (right): https://codereview.chromium.org/1238683003/diff/60001/chrome/browser/ui/searc... chrome/browser/ui/search_engines/search_engine_tab_helper.cc:150: TemplateURLServiceFactory::GetForProfile(profile)) { Don't put a statement with side effects in a conditional. In this case, if you have the profile anyway, get the accept-languages from its prefs directly; don't go through the TemplateURLService. https://codereview.chromium.org/1238683003/diff/60001/chrome/browser/ui/searc... chrome/browser/ui/search_engines/search_engine_tab_helper.cc:202: url_service->search_terms_data().GetAcceptedLanguages())); Again, get the languages off the profile prefs directly here. https://codereview.chromium.org/1238683003/diff/60001/components/search_engin... File components/search_engines/search_terms_data.h (right): https://codereview.chromium.org/1238683003/diff/60001/components/search_engin... components/search_engines/search_terms_data.h:78: // Proper implementation requires access to profile preferences Nit: For this second sentence, just copy the wording of the last sentence for NTPIsThemedParam() above. https://codereview.chromium.org/1238683003/diff/60001/components/search_engin... File components/search_engines/template_url.cc (right): https://codereview.chromium.org/1238683003/diff/60001/components/search_engin... components/search_engines/template_url.cc:1218: // should be IDN-decoded using user's accepted languages. This rationale seems a bit inaccurate. How about adding this to the comment below (see the next two nits about how to reorder the code/comments here for this): "Since the underlying hostname may be punycoded, convert to Unicode if possible, so the keyword will look like the Unicode hostname the user is used to seeing." https://codereview.chromium.org/1238683003/diff/60001/components/search_engin... components/search_engines/template_url.cc:1219: base::string16 keyword = net::IDNToUnicode(url.host(), accept_languages); Nit: Combine this with the line below. https://codereview.chromium.org/1238683003/diff/60001/components/search_engin... components/search_engines/template_url.cc:1222: // Special case: if the host was exactly "www." (not sure this can happen but Nit: Move this "Special case" comment below the StripWWW() call so you can append the commentary about IDN-decoding here instead.
On 2015/07/20 16:47:17, Peter Kasting wrote: > Miguel, in fairness, are other browsers really a factor here? Chrome is more > aggressive than they are (I thought) about using OSDDs for some interesting > purpose. > > Is it easy to demonstrate Firefox and IE also showing punycoded names for the > sample site that was posted here? TBH I haven't tested this in other browsers, I just assumed OSDDs were supported across many browsers. Anyway, it looks like this code has been taken out now and I'm happy with that. (I still think, unless we see other browsers are actually IDN-decoding this field, it is not right to do it; that field is not specified as being IDN-encoded.) lgtm On 2015/07/20 16:48:13, Peter Kasting wrote: > Er, Matt. I don't know why I always want to write "Miguel", I think it's just > visually similar to your username. Sorry! I can see how that happened :)
https://codereview.chromium.org/1238683003/diff/60001/chrome/browser/search_e... File chrome/browser/search_engines/ui_thread_search_terms_data.cc (right): https://codereview.chromium.org/1238683003/diff/60001/chrome/browser/search_e... chrome/browser/search_engines/ui_thread_search_terms_data.cc:188: return profile_->GetPrefs()->GetString(prefs::kAcceptLanguages); On 2015/07/20 19:30:51, Peter Kasting wrote: > Nit: Shorter: > > return profile_ ? > profile_->GetPrefs()->GetString(prefs::kAcceptLanguages) : std::string(); Done. https://codereview.chromium.org/1238683003/diff/60001/chrome/browser/search_e... File chrome/browser/search_engines/ui_thread_search_terms_data.h (right): https://codereview.chromium.org/1238683003/diff/60001/chrome/browser/search_e... chrome/browser/search_engines/ui_thread_search_terms_data.h:36: std::string GetAcceptedLanguages() const override; On 2015/07/20 19:30:51, Peter Kasting wrote: > Call this GetAcceptLanguages(), not "accepted", because "accept-languages" is > the meaningful phrase in question, and we use that terminology elsewhere. (I > know grammatically this is weird.) Done. https://codereview.chromium.org/1238683003/diff/60001/chrome/browser/ui/searc... File chrome/browser/ui/search_engines/search_engine_tab_helper.cc (right): https://codereview.chromium.org/1238683003/diff/60001/chrome/browser/ui/searc... chrome/browser/ui/search_engines/search_engine_tab_helper.cc:150: TemplateURLServiceFactory::GetForProfile(profile)) { On 2015/07/20 19:30:51, Peter Kasting wrote: > Don't put a statement with side effects in a conditional. > > In this case, if you have the profile anyway, get the accept-languages from its > prefs directly; don't go through the TemplateURLService. Done. Although I somewhat disagree. In my brain accept_languages for TemplateURL::GenerateKeyword is not an input bun an environment. And (again, in my mind) environment for TemplateURL should lie with TemplateURLService. https://codereview.chromium.org/1238683003/diff/60001/components/search_engin... File components/search_engines/search_terms_data.h (right): https://codereview.chromium.org/1238683003/diff/60001/components/search_engin... components/search_engines/search_terms_data.h:78: // Proper implementation requires access to profile preferences On 2015/07/20 19:30:51, Peter Kasting wrote: > Nit: For this second sentence, just copy the wording of the last sentence for > NTPIsThemedParam() above. Done. https://codereview.chromium.org/1238683003/diff/60001/components/search_engin... File components/search_engines/template_url.cc (right): https://codereview.chromium.org/1238683003/diff/60001/components/search_engin... components/search_engines/template_url.cc:1218: // should be IDN-decoded using user's accepted languages. On 2015/07/20 19:30:52, Peter Kasting wrote: > This rationale seems a bit inaccurate. How about adding this to the comment > below (see the next two nits about how to reorder the code/comments here for > this): > > "Since the underlying hostname may be punycoded, convert to Unicode if possible, > so the keyword will look like the Unicode hostname the user is used to seeing." Done. But I worded it like this: // Since keyword is generated from hostname, it might be IDN-encoded. // If so, decode it and convert to Unicode using user's accepted languages, // making it look like Unicode hostname the user is used to seeing.
LGTM https://codereview.chromium.org/1238683003/diff/80001/components/search_engin... File components/search_engines/template_url.cc (right): https://codereview.chromium.org/1238683003/diff/80001/components/search_engin... components/search_engines/template_url.cc:1221: // making it look like Unicode hostname the user is used to seeing. Grammar issues. Try this: |url|'s hostname may be IDN-encoded. Before generating |keyword| from it, convert to Unicode using the user's accept-languages, so it won't look like a confusing punycode string.
lgtm
https://codereview.chromium.org/1238683003/diff/80001/components/search_engin... File components/search_engines/template_url.cc (right): https://codereview.chromium.org/1238683003/diff/80001/components/search_engin... components/search_engines/template_url.cc:1221: // making it look like Unicode hostname the user is used to seeing. On 2015/07/21 17:34:09, Peter Kasting wrote: > Grammar issues. Try this: > > |url|'s hostname may be IDN-encoded. Before generating |keyword| from it, > convert to Unicode using the user's accept-languages, so it won't look like a > confusing punycode string. Done.
The CQ bit was checked by alshabalin@yandex-team.ru
The patchset sent to the CQ was uploaded after l-g-t-m from stevet@chromium.org, mgiuca@chromium.org, jochen@chromium.org, pkasting@chromium.org Link to the patchset: https://codereview.chromium.org/1238683003/#ps100001 (title: "Fix grammar.")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1238683003/100001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1238683003/100001
The author alshabalin@yandex-team.ru has not signed Google Contributor License Agreement. Please visit https://cla.developers.google.com to sign and manage CLA.
The author alshabalin@yandex-team.ru has not signed Google Contributor License Agreement. Please visit https://cla.developers.google.com to sign and manage CLA.
The author alshabalin@yandex-team.ru has not signed Google Contributor License Agreement. Please visit https://cla.developers.google.com to sign and manage CLA.
The author alshabalin@yandex-team.ru has not signed Google Contributor License Agreement. Please visit https://cla.developers.google.com to sign and manage CLA.
Message was sent while issue was closed.
Committed patchset #6 (id:100001)
Message was sent while issue was closed.
Patchset 6 (id:??) landed as https://crrev.com/e6212ac6b82fe6e2d73cd1e112b016fa45343927 Cr-Commit-Position: refs/heads/master@{#340053} |