Issue 1238683003: Unpunycode search keywords and short names.

alshabalin

alshabalin@yandex-team.ru changed reviewers: + jochen@chromium.org, stevenjb@chromium.org, stevet@chromium.org

5 years, 5 months ago (2015-07-14 11:50:26 UTC) #1

SteveT

LGTM for search engines - provided that there is another reviewer who understands the Punycode ...

5 years, 5 months ago (2015-07-14 13:33:31 UTC) #3

jochen (gone - plz use gerrit)

jochen@chromium.org changed reviewers: + felt@chromium.org, pkasting@chromium.org

5 years, 5 months ago (2015-07-14 14:35:58 UTC) #4

jochen (gone - plz use gerrit)

+pkasting because omnibox affected +felt because this might be security ux relevant

5 years, 5 months ago (2015-07-14 14:35:59 UTC) #5

Peter Kasting

Is there an associated BUG= here? https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc File chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc (right): https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc#newcode124 chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc:124: // Unpunycode short ...

5 years, 5 months ago (2015-07-14 18:05:33 UTC) #6

Is there an associated BUG= here?

https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensio...
File
chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc
(right):

https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensio...
chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc:124:
// Unpunycode short name and keyword.
Why do we need to do this for these particular cases, but we don't go so far as
to, say, run IDNToUnicode() unconditionally inside SetShortName() and
SetKeyword()?

https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensio...
chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc:125:
data.SetShortName(net::IDNToUnicode(parameters->name, ""));
Nit: Use std::string() in place of "" everywhere in this change.

https://codereview.chromium.org/1238683003/diff/20001/components/search_engin...
File components/search_engines/template_url.cc (right):

https://codereview.chromium.org/1238683003/diff/20001/components/search_engin...
components/search_engines/template_url.cc:1221: base::string16
keyword(net::StripWWW(net::IDNToUnicode(url.host(), "")));
I think we should pass the user's accept-languages here, so we'll un-punycode
more cases that the user should understand.

This should probably be passed to this function as a second arg, and so on up
the call chain until you get to a chrome-side caller that can compute this as:

  profile_->GetPrefs()->GetString(prefs::kAcceptLanguages)

Another route in some cases would be to add an embedder client method to get
these languages, then have the components-side code that needs to call the
client method get the embedder to instantiate such a client instance for it. 
Yet another route would be to add this information, or accessors that compute
it, to SearchTermsData.

https://codereview.chromium.org/1238683003/diff/20001/components/search_engin...
File components/search_engines/template_url_parser.cc (right):

https://codereview.chromium.org/1238683003/diff/20001/components/search_engin...
components/search_engines/template_url_parser.cc:257:
base::IsStringASCII(context->string_)
Why do an ASCII check here?  Why not just unconditionally go the IDNToUnicode()
route as you have elsewhere?

If there's an efficiency issue, perhaps IDNToUnicode() should do an ASCII check
and early-out itself?  Or maybe we need to insert these sorts of conditionals in
other places too?

https://codereview.chromium.org/1238683003/diff/20001/components/search_engin...
File components/search_engines/template_url_unittest.cc (right):

https://codereview.chromium.org/1238683003/diff/20001/components/search_engin...
components/search_engines/template_url_unittest.cc:1708:
ASSERT_EQ(base::UTF8ToUTF16("абв"),
Don't put Unicode characters directly in source files; use \x escapes or
similar.

alshabalin

On 2015/07/14 18:05:33, Peter Kasting wrote: > Is there an associated BUG= here? Couldn't find ...

5 years, 5 months ago (2015-07-15 14:17:35 UTC) #7

alshabalin

https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc File chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc (right): https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc#newcode124 chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc:124: // Unpunycode short name and keyword. On 2015/07/14 18:05:33, ...

5 years, 5 months ago (2015-07-15 14:18:01 UTC) #8

https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensio...
File
chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc
(right):

https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensio...
chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc:124:
// Unpunycode short name and keyword.
On 2015/07/14 18:05:33, Peter Kasting wrote:
> Why do we need to do this for these particular cases, but we don't go so far
as
> to, say, run IDNToUnicode() unconditionally inside SetShortName() and
> SetKeyword()?

Because I want to perform IDN conversion on user input only. For JS API here and
for site-supplied OpenSearch data (in TemplateURLParser) I defensively run this
conversion to catch possible mistakes from the site maintainer: e.g. when site
returns OpenSearch xml with ShortName in punycode (probably auto-generated from
the hostname), I want the end user to see it in the decoded form.

https://codereview.chromium.org/1238683003/diff/20001/components/search_engin...
File components/search_engines/template_url.cc (right):

https://codereview.chromium.org/1238683003/diff/20001/components/search_engin...
components/search_engines/template_url.cc:1221: base::string16
keyword(net::StripWWW(net::IDNToUnicode(url.host(), "")));
On 2015/07/14 18:05:33, Peter Kasting wrote:
> I think we should pass the user's accept-languages here, so we'll un-punycode
> more cases that the user should understand.
> 
> This should probably be passed to this function as a second arg, and so on up
> the call chain until you get to a chrome-side caller that can compute this as:
> 
>   profile_->GetPrefs()->GetString(prefs::kAcceptLanguages)
> 
> Another route in some cases would be to add an embedder client method to get
> these languages, then have the components-side code that needs to call the
> client method get the embedder to instantiate such a client instance for it. 
> Yet another route would be to add this information, or accessors that compute
> it, to SearchTermsData.

My thought was that a string in a language user does not read and a string in
IDN form are equally incomprehensible but the first one looks nicer. After all,
punycode is there only because hostname must be ASCII and as such is an
implementation detail.

Or is there some guideline that we should unpunycode only to the languages
the user understands?

https://codereview.chromium.org/1238683003/diff/20001/components/search_engin...
File components/search_engines/template_url_parser.cc (right):

https://codereview.chromium.org/1238683003/diff/20001/components/search_engin...
components/search_engines/template_url_parser.cc:257:
base::IsStringASCII(context->string_)
On 2015/07/14 18:05:33, Peter Kasting wrote:
> Why do an ASCII check here?  Why not just unconditionally go the
IDNToUnicode()
> route as you have elsewhere?
> 
> If there's an efficiency issue, perhaps IDNToUnicode() should do an ASCII
check
> and early-out itself?  Or maybe we need to insert these sorts of conditionals
in
> other places too?

IDNToUnicode requires ASCII because it calls ASCIIToUTF16 on its argument. Here
it's the data from OpenSearch xml which could contain unicode. In other places
it is guaranteed to be in ASCII: in JS API it previously called ASCIIToUTF16, in
TemplateURL::GenerateKeyword it uses hostname.

But it might indeed make sense to move the ASCII check to IDNToUnicode.

alshabalin

https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc File chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc (right): https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc#newcode125 chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc:125: data.SetShortName(net::IDNToUnicode(parameters->name, "")); On 2015/07/14 18:05:33, Peter Kasting wrote: > ...

5 years, 5 months ago (2015-07-15 15:19:50 UTC) #9

felt

On 2015/07/15 14:17:35, alshabalin wrote: > On 2015/07/14 18:05:33, Peter Kasting wrote: > > Is ...

5 years, 5 months ago (2015-07-15 16:16:41 UTC) #10

Peter Kasting

https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc File chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc (right): https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc#newcode124 chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc:124: // Unpunycode short name and keyword. On 2015/07/15 at ...

5 years, 5 months ago (2015-07-15 21:19:12 UTC) #11

https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensio...
File
chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc
(right):

https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensio...
chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc:124:
// Unpunycode short name and keyword.
On 2015/07/15 at 14:18:01, alshabalin wrote:
> On 2015/07/14 18:05:33, Peter Kasting wrote:
> > Why do we need to do this for these particular cases, but we don't go so far
as
> > to, say, run IDNToUnicode() unconditionally inside SetShortName() and
> > SetKeyword()?
> 
> Because I want to perform IDN conversion on user input only. For JS API here
and
> for site-supplied OpenSearch data (in TemplateURLParser) I defensively run
this
> conversion to catch possible mistakes from the site maintainer: e.g. when site
> returns OpenSearch xml with ShortName in punycode (probably auto-generated
from
> the hostname), I want the end user to see it in the decoded form.

I don't understand your paragraph.  It seems to first say you want to only
un-punycode user input, then shortly after that you want to un-punycode
non-user-input (e.g. stuff in the OSDD).

https://codereview.chromium.org/1238683003/diff/20001/components/search_engin...
File components/search_engines/template_url.cc (right):

https://codereview.chromium.org/1238683003/diff/20001/components/search_engin...
components/search_engines/template_url.cc:1221: base::string16
keyword(net::StripWWW(net::IDNToUnicode(url.host(), "")));
On 2015/07/15 at 14:18:01, alshabalin wrote:
> On 2015/07/14 18:05:33, Peter Kasting wrote:
> > I think we should pass the user's accept-languages here, so we'll
un-punycode
> > more cases that the user should understand.
> > 
> > This should probably be passed to this function as a second arg, and so on
up
> > the call chain until you get to a chrome-side caller that can compute this
as:
> > 
> >   profile_->GetPrefs()->GetString(prefs::kAcceptLanguages)
> > 
> > Another route in some cases would be to add an embedder client method to get
> > these languages, then have the components-side code that needs to call the
> > client method get the embedder to instantiate such a client instance for it.

> > Yet another route would be to add this information, or accessors that
compute
> > it, to SearchTermsData.
> 
> My thought was that a string in a language user does not read and a string in
> IDN form are equally incomprehensible but the first one looks nicer. After
all,
> punycode is there only because hostname must be ASCII and as such is an
> implementation detail.
> 
> Or is there some guideline that we should unpunycode only to the languages
> the user understands?

It seems like you're misunderstanding how the conversion works.  If you pass no
languages, we will fail to render as Unicode any hostnames that mix characters
from different scripts.  If you pass the languages the user understands, then
we'll be willing to show as Unicode hostnames which use characters from those
scripts.  Passing the empty string doesn't mean "un-punycode everything", it
means something closer to the opposite.

In other words, I'm trying to get your code to un-punycode more hostnames, not
fewer.

https://codereview.chromium.org/1238683003/diff/20001/components/search_engin...
File components/search_engines/template_url_parser.cc (right):

https://codereview.chromium.org/1238683003/diff/20001/components/search_engin...
components/search_engines/template_url_parser.cc:257:
base::IsStringASCII(context->string_)
On 2015/07/15 at 14:18:01, alshabalin wrote:
> On 2015/07/14 18:05:33, Peter Kasting wrote:
> > Why do an ASCII check here?  Why not just unconditionally go the
IDNToUnicode()
> > route as you have elsewhere?
> > 
> > If there's an efficiency issue, perhaps IDNToUnicode() should do an ASCII
check
> > and early-out itself?  Or maybe we need to insert these sorts of
conditionals in
> > other places too?
> 
> IDNToUnicode requires ASCII because it calls ASCIIToUTF16 on its argument.
Here
> it's the data from OpenSearch xml which could contain unicode. In other places
> it is guaranteed to be in ASCII: in JS API it previously called ASCIIToUTF16,
in
> TemplateURL::GenerateKeyword it uses hostname.
> 
> But it might indeed make sense to move the ASCII check to IDNToUnicode.

At the least you need a comment in all these places noting why you do or do not
need an ASCII check.

alshabalin

https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc File chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc (right): https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc#newcode124 chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc:124: // Unpunycode short name and keyword. On 2015/07/15 21:19:12, ...

5 years, 5 months ago (2015-07-16 15:13:07 UTC) #12

https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensio...
File
chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc
(right):

https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensio...
chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc:124:
// Unpunycode short name and keyword.
On 2015/07/15 21:19:12, Peter Kasting wrote:
> On 2015/07/15 at 14:18:01, alshabalin wrote:
> > On 2015/07/14 18:05:33, Peter Kasting wrote:
> > > Why do we need to do this for these particular cases, but we don't go so
far
> as
> > > to, say, run IDNToUnicode() unconditionally inside SetShortName() and
> > > SetKeyword()?
> > 
> > Because I want to perform IDN conversion on user input only. For JS API here
> and
> > for site-supplied OpenSearch data (in TemplateURLParser) I defensively run
> this
> > conversion to catch possible mistakes from the site maintainer: e.g. when
site
> > returns OpenSearch xml with ShortName in punycode (probably auto-generated
> from
> > the hostname), I want the end user to see it in the decoded form.
> 
> I don't understand your paragraph.  It seems to first say you want to only
> un-punycode user input, then shortly after that you want to un-punycode
> non-user-input (e.g. stuff in the OSDD).

I was thinking server-side user at the time and actually meant input from OSDD.
It is confusing, sorry. Anyway, my point is to leave the decision whether to
un-punycode or not to the caller because un-punycoding ShortName is not
a normal behaviour but a workaround to fix accidental mistakes for the end-user
benefit. I can't come up with an example when doing the conversion would hurt
though.

https://codereview.chromium.org/1238683003/diff/20001/components/search_engin...
File components/search_engines/template_url.cc (right):

https://codereview.chromium.org/1238683003/diff/20001/components/search_engin...
components/search_engines/template_url.cc:1221: base::string16
keyword(net::StripWWW(net::IDNToUnicode(url.host(), "")));
On 2015/07/15 21:19:12, Peter Kasting wrote:
> On 2015/07/15 at 14:18:01, alshabalin wrote:
> > On 2015/07/14 18:05:33, Peter Kasting wrote:
> > > I think we should pass the user's accept-languages here, so we'll
> un-punycode
> > > more cases that the user should understand.
> > > 
> > > This should probably be passed to this function as a second arg, and so on
> up
> > > the call chain until you get to a chrome-side caller that can compute this
> as:
> > > 
> > >   profile_->GetPrefs()->GetString(prefs::kAcceptLanguages)
> > > 
> > > Another route in some cases would be to add an embedder client method to
get
> > > these languages, then have the components-side code that needs to call the
> > > client method get the embedder to instantiate such a client instance for
it.
> 
> > > Yet another route would be to add this information, or accessors that
> compute
> > > it, to SearchTermsData.
> > 
> > My thought was that a string in a language user does not read and a string
in
> > IDN form are equally incomprehensible but the first one looks nicer. After
> all,
> > punycode is there only because hostname must be ASCII and as such is an
> > implementation detail.
> > 
> > Or is there some guideline that we should unpunycode only to the languages
> > the user understands?
> 
> It seems like you're misunderstanding how the conversion works.  If you pass
no
> languages, we will fail to render as Unicode any hostnames that mix characters
> from different scripts.  If you pass the languages the user understands, then
> we'll be willing to show as Unicode hostnames which use characters from those
> scripts.  Passing the empty string doesn't mean "un-punycode everything", it
> means something closer to the opposite.
> 
> In other words, I'm trying to get your code to un-punycode more hostnames, not
> fewer.

I thought script-mixing in hostnames is not allowed anyway because it opens up
more ways to spoof. But apparently I'm wrong
https://tools.ietf.org/html/rfc5890#section-4.4
My mistake, sorry, will rewrite to pass accepted languages then.

alshabalin

On 2015/07/15 16:16:41, felt wrote: > On 2015/07/15 14:17:35, alshabalin wrote: > > On 2015/07/14 ...

5 years, 5 months ago (2015-07-16 15:14:05 UTC) #13

felt

felt@chromium.org changed reviewers: + mgiuca@chromium.org - felt@chromium.org

5 years, 5 months ago (2015-07-16 15:17:01 UTC) #14

Peter Kasting

https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc File chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc (right): https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc#newcode124 chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc:124: // Unpunycode short name and keyword. On 2015/07/16 at ...

5 years, 5 months ago (2015-07-16 19:16:27 UTC) #17

https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensio...
File
chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc
(right):

https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensio...
chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc:124:
// Unpunycode short name and keyword.
On 2015/07/16 at 15:13:07, alshabalin wrote:
> On 2015/07/15 21:19:12, Peter Kasting wrote:
> > On 2015/07/15 at 14:18:01, alshabalin wrote:
> > > On 2015/07/14 18:05:33, Peter Kasting wrote:
> > > > Why do we need to do this for these particular cases, but we don't go so
far
> > as
> > > > to, say, run IDNToUnicode() unconditionally inside SetShortName() and
> > > > SetKeyword()?
> > > 
> > > Because I want to perform IDN conversion on user input only. For JS API
here
> > and
> > > for site-supplied OpenSearch data (in TemplateURLParser) I defensively run
> > this
> > > conversion to catch possible mistakes from the site maintainer: e.g. when
site
> > > returns OpenSearch xml with ShortName in punycode (probably auto-generated
> > from
> > > the hostname), I want the end user to see it in the decoded form.
> > 
> > I don't understand your paragraph.  It seems to first say you want to only
> > un-punycode user input, then shortly after that you want to un-punycode
> > non-user-input (e.g. stuff in the OSDD).
> 
> I was thinking server-side user at the time and actually meant input from
OSDD.
> It is confusing, sorry. Anyway, my point is to leave the decision whether to
> un-punycode or not to the caller because un-punycoding ShortName is not
> a normal behaviour but a workaround to fix accidental mistakes for the
end-user
> benefit. I can't come up with an example when doing the conversion would hurt
> though.

Well, it's not normal in the sense that we don't currently do it, but I'm not
convinced it's ever wrong.  I'm inclined to suggest centralizing the conversion
so it just always happens.  However, that might make getting the correct
languages tricky, so it may be that the current solution is easier to implement.

Matt Giuca

Hi, thanks for including me, felt@. I have a problem with *some* of the decodings ...

5 years, 5 months ago (2015-07-17 04:13:11 UTC) #18

Hi, thanks for including me, felt@.

I have a problem with *some* of the decodings in this CL, as indicated in my
comments. This CL attempts to change three conversions:

1. In the searchEnginePrivate.addOtherSearchEngine private API function,
IDN-decode the "name" and "keyword" arguments before storing them in the search
engines database. (I believe this API is used by the Settings page when the user
adds or alters a search engine.)
2. When a search engine is automatically generated from a domain name,
IDN-decode the domain name before storing it as the short name / keyword in the
search engines database.
3. When parsing an OpenSearch description document (XML), IDN-decode the
ShortName field before storing it in the search engines database.

As I've noted in the comments, I believe #2 is a good change, but #1 and #3 are
invalid. If you are converting a domain name into a human-readable string (as in
#2), then IDN-decoding is a necessary step. But #1 and #3 are not processing
domain names, they are dealing with ordinary strings specifically intended for
use as a search engine short name / keyword, so why treat them as a domain name?

It looks like the case in the bug report (http://crbug.com/510832) of
"http://центрсвязи.рф" does not have an OpenSearch XML file, so the problem
comes from auto-generating the keyword from a domain name. That will be fixed by
#2. Any site that has an OpenSearch XML file should not need fixing because the
ShortName field should not be IDN-encoded (and if it is, it is a problem with
the site, not something we should be fixing on the user agent side).

Please see my detailed comments.

https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensio...
File
chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc
(right):

https://codereview.chromium.org/1238683003/diff/20001/chrome/browser/extensio...
chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc:124:
// Unpunycode short name and keyword.
It sounds like you're saying that because site maintainers might make the
mistake of putting Punycode in the XML ShortName field, we should pro-actively
decode it for them...?

That sounds like a Bad Thing. The OpenSearch spec [1] does not say that this
should be Punycode-encoded, or that it is a domain name. It simply says that
it's "a brief human-readable title".

You could make a similar argument for decoding percent-escaped text in this
field, or performing any other type of unescaping, "just in case" some badly
formatted data was accidentally encoded. But that's not how user agents should
deal with data.

(Conversely: do you have any evidence of high-profile sites that have a
Punycoded domain in their ShortName field?)

[1]
http://www.opensearch.org/Specifications/OpenSearch/1.1#The_.22ShortName.22_e...

https://codereview.chromium.org/1238683003/diff/40001/chrome/browser/extensio...
File
chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc
(right):

https://codereview.chromium.org/1238683003/diff/40001/chrome/browser/extensio...
chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc:127:
data.SetShortName(net::IDNToUnicode(parameters->name, std::string()));
This should not be unpunycoded. The name and keyword parameters are not domains
and should not be expected to be Punycode-encoded (they are being passed in
through an API).

If the rationale for this change is that users might type in an IDN-encoded
domain into the "keyword" or "short name" field, and that will get passed
through this API, and we want to decode those, I would say:
a) that is not our concern. The user typed/pasted a string into a text field,
and we should respect that string. There is no reason to decode the string in
this text field any more than if they typed/pasted it into any other text field.
b) if we did want to decode that user input, it should be done at the user
interface level, not at the API level. You should do the decoding in the
settings page UI, not here.

https://codereview.chromium.org/1238683003/diff/40001/chrome/browser/extensio...
chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc:162:
turl, base::UTF8ToUTF16(parameters->name),
If you're going to change addOtherSearchEngine, you also need to change this
behaviour (updateSearchEngine).

(But I'm telling you to revert the above change, so you shouldn't have to do
anything here.)

https://codereview.chromium.org/1238683003/diff/40001/components/search_engin...
File components/search_engines/template_url.cc (right):

https://codereview.chromium.org/1238683003/diff/40001/components/search_engin...
components/search_engines/template_url.cc:1220: // We do not want any punycode
in any language.
This comment should be more detailed and should refer to IDN, not Punycode. How
about:
"The URL host may be an internationalized domain name (IDN). If so, decode it
into a regular Unicode string."

You should not be using the term "Punycode" here. Punycode is a sub-step in the
IDN encoding/decoding algorithm. Roughly, the IDN encoding algorithm is this:
1. Split the domain name by '.' character.
2. For any segment that contains non-ASCII characters, Punycode-encode it and
prepend "xn--".
3. Join the segments by '.' characters.

So what you are doing here is not Punycode-decoding, but IDN-decoding.

Also I'd like to see it separated from the above comment about stripping "www".
Maybe break this out like so:

base::string16 keyword = url.host();
// <comment about decoding IDN>
keyword = net::IDNToUnicode(keyword, std::string());
// <comment about stripping "www">
keyword = net::StripWWW(keyword);

https://codereview.chromium.org/1238683003/diff/40001/components/search_engin...
File components/search_engines/template_url_fetcher_unittest.cc (right):

https://codereview.chromium.org/1238683003/diff/40001/components/search_engin...
components/search_engines/template_url_fetcher_unittest.cc:335: std::string
osdd_file_name("punycode_open_search.xml");
This test is not necessary since we do not expect ShortNames to be IDN-encoded.
(And if they are, we should not do anything special in that case.)

https://codereview.chromium.org/1238683003/diff/40001/components/search_engin...
components/search_engines/template_url_fetcher_unittest.cc:344:
TEST_F(TemplateURLFetcherTest, UnicodeTest) {
This is a good test! I'm kind of surprised that none of the other tests in this
file actually check *any* fields of the TemplateURL object they get back from
GetTemplateURLForKeyword. If you're interested, you could go to the above tests
and add some assertions to that effect.

(i.e., this is the very first assertion in this file for t-url->short_name().)

https://codereview.chromium.org/1238683003/diff/40001/components/search_engin...
components/search_engines/template_url_fetcher_unittest.cc:356:
EXPECT_EQ(keyword, t_url->short_name());
It's not clear why you expect short_name to equal keyword (other than the
coincidence that the keyword and short name are the same string). For a more
robust test, can you change the keyword to be a different string to the short
name, and introduce a new string "expected_short_name" for this test?

https://codereview.chromium.org/1238683003/diff/40001/components/search_engin...
File components/search_engines/template_url_parser.cc (right):

https://codereview.chromium.org/1238683003/diff/40001/components/search_engin...
components/search_engines/template_url_parser.cc:258:
base::IsStringASCII(context->string_)
This should not be unpunycoded. The ShortName field is not a domain and should
not be expected to be Punycode-encoded.

https://codereview.chromium.org/1238683003/diff/40001/components/test/data/se...
File components/test/data/search_engines/punycode_open_search.xml (right):

https://codereview.chromium.org/1238683003/diff/40001/components/test/data/se...
components/test/data/search_engines/punycode_open_search.xml:1: <?xml
version="1.0"?>
Delete this file, as discussed above.

Peter Kasting

I'm probably not as convinced of the invalidity of aggressive un-punycoding as mgiuca is, but ...

5 years, 5 months ago (2015-07-17 05:12:54 UTC) #19

alshabalin

On 2015/07/17 04:13:11, Matt Giuca wrote: > Hi, thanks for including me, felt@. > > ...

5 years, 5 months ago (2015-07-17 09:39:57 UTC) #20

On 2015/07/17 04:13:11, Matt Giuca wrote:
> Hi, thanks for including me, felt@.
> 
> I have a problem with *some* of the decodings in this CL, as indicated in my
> comments. This CL attempts to change three conversions:
> 
> 1. In the searchEnginePrivate.addOtherSearchEngine private API function,
> IDN-decode the "name" and "keyword" arguments before storing them in the
search
> engines database. (I believe this API is used by the Settings page when the
user
> adds or alters a search engine.)
> 2. When a search engine is automatically generated from a domain name,
> IDN-decode the domain name before storing it as the short name / keyword in
the
> search engines database.
> 3. When parsing an OpenSearch description document (XML), IDN-decode the
> ShortName field before storing it in the search engines database.
> 
> As I've noted in the comments, I believe #2 is a good change, but #1 and #3
are
> invalid. If you are converting a domain name into a human-readable string (as
in
> #2), then IDN-decoding is a necessary step. But #1 and #3 are not processing
> domain names, they are dealing with ordinary strings specifically intended for
> use as a search engine short name / keyword, so why treat them as a domain
name?

About #1: My initial thought was that this API was to become a way for sites to
interactively add themselves as search engines. But I see I'm wrong
http://crbug.com/480043 Ok, will remove.

> It looks like the case in the bug report (http://crbug.com/510832) of
> "http://центрсвязи.рф" does not have an OpenSearch XML file, so the problem
> comes from auto-generating the keyword from a domain name. That will be fixed
by
> #2. Any site that has an OpenSearch XML file should not need fixing because
the
> ShortName field should not be IDN-encoded (and if it is, it is a problem with
> the site, not something we should be fixing on the user agent side).

I have but one example: the other site mentioned in the bug report
"http://новостиворонежа.рф" falls into #3 category. It delivers OpenSearch XML
roughly via this url
http://yandex.ru/sitesearch/opensearch.xml?ShortName=xn--80adamfj7adfbcazkp.x...
Notice that it erroneously asks it to set ShortName in IDN-encoding instead of
URL-encoding.
This site however is not high profile which is probably why this has gone
unnoticed.
The decision to do IDN-conversion client-side anyway was made because it doesn't
seem to complicate code much and doesn't hurt existing functionality while
(slightly)
improving end-user experience. If you still want it gone, I will remove it.

Matt Giuca

On 2015/07/17 09:39:57, alshabalin wrote: > > About #1: My initial thought was that this ...

5 years, 5 months ago (2015-07-20 01:49:03 UTC) #22

On 2015/07/17 09:39:57, alshabalin wrote:
> 
> About #1: My initial thought was that this API was to become a way for sites
to
> interactively add themselves as search engines. But I see I'm wrong
> http://crbug.com/480043 Ok, will remove.

Even if it were an API for sites to interactively add themselves, we would want
them to be passing normal strings to that API, not IDN-encoded strings. So we
wouldn't want this here either way.

> > It looks like the case in the bug report (http://crbug.com/510832) of
> > "http://центрсвязи.рф" does not have an OpenSearch XML file, so the problem
> > comes from auto-generating the keyword from a domain name. That will be
fixed
> by
> > #2. Any site that has an OpenSearch XML file should not need fixing because
> the
> > ShortName field should not be IDN-encoded (and if it is, it is a problem
with
> > the site, not something we should be fixing on the user agent side).
> 
> I have but one example: the other site mentioned in the bug report
> "http://новостиворонежа.рф" falls into #3 category. It delivers OpenSearch XML
> roughly via this url
>
http://yandex.ru/sitesearch/opensearch.xml?ShortName=xn--80adamfj7adfbcazkp.x...

OK, I see that it does that. (It seems to do it client-side, because the "view
source" version of the page doesn't have the link, but the DOM tree does.)

> Notice that it erroneously asks it to set ShortName in IDN-encoding instead of
> URL-encoding.
> This site however is not high profile which is probably why this has gone
> unnoticed.
> The decision to do IDN-conversion client-side anyway was made because it
doesn't
> seem to complicate code much and doesn't hurt existing functionality while
> (slightly)
> improving end-user experience.

We can't add new browser functionality (no matter how simple the code is)
because one site (or a handful of sites) are doing the wrong thing. That
fragments the web. It means we're giving those sites our blessing to continue
doing the wrong thing, at a cost of compatibility with other browsers. Instead,
those sites should change their behaviour to match the existing specification.

For analogy: suppose you found a site (or three sites) that had an IDN-encoded
<title> element, which was causing a user-visible unreadable ASCII string to
appear in the browser's title bar, instead of the site's title. Should we change
the parser to always attempt to IDN-decode all <title> elements on all HTML
pages, to "fix" this problem? No, because this moves us away from the standard
and from the behaviour of other browsers, and makes the problem worse (as we may
encourage more sites to IDN-encode their titles). Instead, those sites should
fix their broken HTML. The OpenSearch XML is the same issue, but less visible.

On the other hand, if you can show that there is a widespread practice of sites
IDN-encoding the ShortName field, then there is a case to make this change, but
it should probably be done in consultation with the OpenSearch standard authors
and other browser vendors (i.e. not in this CL).

> If you still want it gone, I will remove it.

Yes, please.

alshabalin

https://codereview.chromium.org/1238683003/diff/20001/components/search_engines/template_url.cc File components/search_engines/template_url.cc (right): https://codereview.chromium.org/1238683003/diff/20001/components/search_engines/template_url.cc#newcode1221 components/search_engines/template_url.cc:1221: base::string16 keyword(net::StripWWW(net::IDNToUnicode(url.host(), ""))); On 2015/07/15 21:19:12, Peter Kasting wrote: ...

5 years, 5 months ago (2015-07-20 14:06:38 UTC) #23

https://codereview.chromium.org/1238683003/diff/20001/components/search_engin...
File components/search_engines/template_url.cc (right):

https://codereview.chromium.org/1238683003/diff/20001/components/search_engin...
components/search_engines/template_url.cc:1221: base::string16
keyword(net::StripWWW(net::IDNToUnicode(url.host(), "")));
On 2015/07/15 21:19:12, Peter Kasting wrote:
> On 2015/07/15 at 14:18:01, alshabalin wrote:
> > On 2015/07/14 18:05:33, Peter Kasting wrote:
> > > I think we should pass the user's accept-languages here, so we'll
> un-punycode
> > > more cases that the user should understand.
> > > 
> > > This should probably be passed to this function as a second arg, and so on
> up
> > > the call chain until you get to a chrome-side caller that can compute this
> as:
> > > 
> > >   profile_->GetPrefs()->GetString(prefs::kAcceptLanguages)
> > > 
> > > Another route in some cases would be to add an embedder client method to
get
> > > these languages, then have the components-side code that needs to call the
> > > client method get the embedder to instantiate such a client instance for
it.
> 
> > > Yet another route would be to add this information, or accessors that
> compute
> > > it, to SearchTermsData.
> > 
> > My thought was that a string in a language user does not read and a string
in
> > IDN form are equally incomprehensible but the first one looks nicer. After
> all,
> > punycode is there only because hostname must be ASCII and as such is an
> > implementation detail.
> > 
> > Or is there some guideline that we should unpunycode only to the languages
> > the user understands?
> 
> It seems like you're misunderstanding how the conversion works.  If you pass
no
> languages, we will fail to render as Unicode any hostnames that mix characters
> from different scripts.  If you pass the languages the user understands, then
> we'll be willing to show as Unicode hostnames which use characters from those
> scripts.  Passing the empty string doesn't mean "un-punycode everything", it
> means something closer to the opposite.
> 
> In other words, I'm trying to get your code to un-punycode more hostnames, not
> fewer.

Done.

https://codereview.chromium.org/1238683003/diff/40001/chrome/browser/extensio...
File
chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc
(right):

https://codereview.chromium.org/1238683003/diff/40001/chrome/browser/extensio...
chrome/browser/extensions/api/search_engines_private/search_engines_private_api.cc:127:
data.SetShortName(net::IDNToUnicode(parameters->name, std::string()));
On 2015/07/17 04:13:11, Matt Giuca wrote:
> This should not be unpunycoded. The name and keyword parameters are not
domains
> and should not be expected to be Punycode-encoded (they are being passed in
> through an API).
> 
> If the rationale for this change is that users might type in an IDN-encoded
> domain into the "keyword" or "short name" field, and that will get passed
> through this API, and we want to decode those, I would say:
> a) that is not our concern. The user typed/pasted a string into a text field,
> and we should respect that string. There is no reason to decode the string in
> this text field any more than if they typed/pasted it into any other text
field.
> b) if we did want to decode that user input, it should be done at the user
> interface level, not at the API level. You should do the decoding in the
> settings page UI, not here.

Done.

https://codereview.chromium.org/1238683003/diff/40001/components/search_engin...
File components/search_engines/template_url.cc (right):

https://codereview.chromium.org/1238683003/diff/40001/components/search_engin...
components/search_engines/template_url.cc:1220: // We do not want any punycode
in any language.
On 2015/07/17 04:13:11, Matt Giuca wrote:
> This comment should be more detailed and should refer to IDN, not Punycode.
How
> about:
> "The URL host may be an internationalized domain name (IDN). If so, decode it
> into a regular Unicode string."
> 
> You should not be using the term "Punycode" here. Punycode is a sub-step in
the
> IDN encoding/decoding algorithm. Roughly, the IDN encoding algorithm is this:
> 1. Split the domain name by '.' character.
> 2. For any segment that contains non-ASCII characters, Punycode-encode it and
> prepend "xn--".
> 3. Join the segments by '.' characters.
> 
> So what you are doing here is not Punycode-decoding, but IDN-decoding.
> 
> Also I'd like to see it separated from the above comment about stripping
"www".
> Maybe break this out like so:
> 
> base::string16 keyword = url.host();
> // <comment about decoding IDN>
> keyword = net::IDNToUnicode(keyword, std::string());
> // <comment about stripping "www">
> keyword = net::StripWWW(keyword);

Done.

https://codereview.chromium.org/1238683003/diff/40001/components/search_engin...
File components/search_engines/template_url_fetcher_unittest.cc (right):

https://codereview.chromium.org/1238683003/diff/40001/components/search_engin...
components/search_engines/template_url_fetcher_unittest.cc:335: std::string
osdd_file_name("punycode_open_search.xml");
On 2015/07/17 04:13:11, Matt Giuca wrote:
> This test is not necessary since we do not expect ShortNames to be
IDN-encoded.
> (And if they are, we should not do anything special in that case.)

Done.

https://codereview.chromium.org/1238683003/diff/40001/components/search_engin...
components/search_engines/template_url_fetcher_unittest.cc:344:
TEST_F(TemplateURLFetcherTest, UnicodeTest) {
On 2015/07/17 04:13:11, Matt Giuca wrote:
> This is a good test! I'm kind of surprised that none of the other tests in
this
> file actually check *any* fields of the TemplateURL object they get back from
> GetTemplateURLForKeyword. If you're interested, you could go to the above
tests
> and add some assertions to that effect.
> 
> (i.e., this is the very first assertion in this file for t-url->short_name().)

Added a short_name() assertion to BasicAutodectedTest.

https://codereview.chromium.org/1238683003/diff/40001/components/search_engin...
components/search_engines/template_url_fetcher_unittest.cc:356:
EXPECT_EQ(keyword, t_url->short_name());
On 2015/07/17 04:13:11, Matt Giuca wrote:
> It's not clear why you expect short_name to equal keyword (other than the
> coincidence that the keyword and short name are the same string). For a more
> robust test, can you change the keyword to be a different string to the short
> name, and introduce a new string "expected_short_name" for this test?

Done. (Although I didn't introduce expected_short_name, making expectation
base::UTF8ToUTF16 of a string literal)

https://codereview.chromium.org/1238683003/diff/40001/components/search_engin...
File components/search_engines/template_url_parser.cc (right):

https://codereview.chromium.org/1238683003/diff/40001/components/search_engin...
components/search_engines/template_url_parser.cc:258:
base::IsStringASCII(context->string_)
On 2015/07/17 04:13:11, Matt Giuca wrote:
> This should not be unpunycoded. The ShortName field is not a domain and should
> not be expected to be Punycode-encoded.

Done.

https://codereview.chromium.org/1238683003/diff/40001/components/test/data/se...
File components/test/data/search_engines/punycode_open_search.xml (right):

https://codereview.chromium.org/1238683003/diff/40001/components/test/data/se...
components/test/data/search_engines/punycode_open_search.xml:1: <?xml
version="1.0"?>
On 2015/07/17 04:13:11, Matt Giuca wrote:
> Delete this file, as discussed above.

Done.

Peter Kasting

Miguel, in fairness, are other browsers really a factor here? Chrome is more aggressive than ...

5 years, 5 months ago (2015-07-20 16:47:17 UTC) #24

Peter Kasting

On 2015/07/20 at 16:47:17, Peter Kasting wrote: > Miguel, Er, Matt. I don't know why ...

5 years, 5 months ago (2015-07-20 16:48:13 UTC) #25

stevenjb

stevenjb@chromium.org changed reviewers: - stevenjb@chromium.org

5 years, 5 months ago (2015-07-20 17:54:53 UTC) #26

Peter Kasting

https://codereview.chromium.org/1238683003/diff/60001/chrome/browser/search_engines/ui_thread_search_terms_data.cc File chrome/browser/search_engines/ui_thread_search_terms_data.cc (right): https://codereview.chromium.org/1238683003/diff/60001/chrome/browser/search_engines/ui_thread_search_terms_data.cc#newcode188 chrome/browser/search_engines/ui_thread_search_terms_data.cc:188: return profile_->GetPrefs()->GetString(prefs::kAcceptLanguages); Nit: Shorter: return profile_ ? profile_->GetPrefs()->GetString(prefs::kAcceptLanguages) : ...

5 years, 5 months ago (2015-07-20 19:30:52 UTC) #27

Matt Giuca

On 2015/07/20 16:47:17, Peter Kasting wrote: > Miguel, in fairness, are other browsers really a ...

5 years, 5 months ago (2015-07-21 00:49:39 UTC) #28

alshabalin

https://codereview.chromium.org/1238683003/diff/60001/chrome/browser/search_engines/ui_thread_search_terms_data.cc File chrome/browser/search_engines/ui_thread_search_terms_data.cc (right): https://codereview.chromium.org/1238683003/diff/60001/chrome/browser/search_engines/ui_thread_search_terms_data.cc#newcode188 chrome/browser/search_engines/ui_thread_search_terms_data.cc:188: return profile_->GetPrefs()->GetString(prefs::kAcceptLanguages); On 2015/07/20 19:30:51, Peter Kasting wrote: > ...

5 years, 5 months ago (2015-07-21 08:44:19 UTC) #29

https://codereview.chromium.org/1238683003/diff/60001/chrome/browser/search_e...
File chrome/browser/search_engines/ui_thread_search_terms_data.cc (right):

https://codereview.chromium.org/1238683003/diff/60001/chrome/browser/search_e...
chrome/browser/search_engines/ui_thread_search_terms_data.cc:188: return
profile_->GetPrefs()->GetString(prefs::kAcceptLanguages);
On 2015/07/20 19:30:51, Peter Kasting wrote:
> Nit: Shorter:
> 
>   return profile_ ?
>       profile_->GetPrefs()->GetString(prefs::kAcceptLanguages) :
std::string();

Done.

https://codereview.chromium.org/1238683003/diff/60001/chrome/browser/search_e...
File chrome/browser/search_engines/ui_thread_search_terms_data.h (right):

https://codereview.chromium.org/1238683003/diff/60001/chrome/browser/search_e...
chrome/browser/search_engines/ui_thread_search_terms_data.h:36: std::string
GetAcceptedLanguages() const override;
On 2015/07/20 19:30:51, Peter Kasting wrote:
> Call this GetAcceptLanguages(), not "accepted", because "accept-languages" is
> the meaningful phrase in question, and we use that terminology elsewhere.  (I
> know grammatically this is weird.)

Done.

https://codereview.chromium.org/1238683003/diff/60001/chrome/browser/ui/searc...
File chrome/browser/ui/search_engines/search_engine_tab_helper.cc (right):

https://codereview.chromium.org/1238683003/diff/60001/chrome/browser/ui/searc...
chrome/browser/ui/search_engines/search_engine_tab_helper.cc:150:
TemplateURLServiceFactory::GetForProfile(profile)) {
On 2015/07/20 19:30:51, Peter Kasting wrote:
> Don't put a statement with side effects in a conditional.
> 
> In this case, if you have the profile anyway, get the accept-languages from
its
> prefs directly; don't go through the TemplateURLService.

Done.

Although I somewhat disagree. In my brain accept_languages for
TemplateURL::GenerateKeyword
is not an input bun an environment. And (again, in my mind) environment for
TemplateURL should
lie with TemplateURLService.

https://codereview.chromium.org/1238683003/diff/60001/components/search_engin...
File components/search_engines/search_terms_data.h (right):

https://codereview.chromium.org/1238683003/diff/60001/components/search_engin...
components/search_engines/search_terms_data.h:78: // Proper implementation
requires access to profile preferences
On 2015/07/20 19:30:51, Peter Kasting wrote:
> Nit: For this second sentence, just copy the wording of the last sentence for
> NTPIsThemedParam() above.

Done.

https://codereview.chromium.org/1238683003/diff/60001/components/search_engin...
File components/search_engines/template_url.cc (right):

https://codereview.chromium.org/1238683003/diff/60001/components/search_engin...
components/search_engines/template_url.cc:1218: // should be IDN-decoded using
user's accepted languages.
On 2015/07/20 19:30:52, Peter Kasting wrote:
> This rationale seems a bit inaccurate.  How about adding this to the comment
> below (see the next two nits about how to reorder the code/comments here for
> this):
> 
> "Since the underlying hostname may be punycoded, convert to Unicode if
possible,
> so the keyword will look like the Unicode hostname the user is used to
seeing."

Done.

But I worded it like this:
  // Since keyword is generated from hostname, it might be IDN-encoded.
  // If so, decode it and convert to Unicode using user's accepted languages,
  // making it look like Unicode hostname the user is used to seeing.

Peter Kasting

LGTM https://codereview.chromium.org/1238683003/diff/80001/components/search_engines/template_url.cc File components/search_engines/template_url.cc (right): https://codereview.chromium.org/1238683003/diff/80001/components/search_engines/template_url.cc#newcode1221 components/search_engines/template_url.cc:1221: // making it look like Unicode hostname the ...

5 years, 5 months ago (2015-07-21 17:34:09 UTC) #30

alshabalin

https://codereview.chromium.org/1238683003/diff/80001/components/search_engines/template_url.cc File components/search_engines/template_url.cc (right): https://codereview.chromium.org/1238683003/diff/80001/components/search_engines/template_url.cc#newcode1221 components/search_engines/template_url.cc:1221: // making it look like Unicode hostname the user ...

5 years, 5 months ago (2015-07-22 14:31:17 UTC) #32

alshabalin

The patchset sent to the CQ was uploaded after l-g-t-m from stevet@chromium.org, mgiuca@chromium.org, jochen@chromium.org, pkasting@chromium.org ...

5 years, 5 months ago (2015-07-23 06:39:30 UTC) #34