Issue 2132103002: Split CJK full names into name parts correctly.

nicolaso

Description was changed from ========== Split CJK full names into name parts correctly. This a ...

4 years, 5 months ago (2016-07-08 18:08:59 UTC) #1

nicolaso

nicolaso@google.com changed reviewers: + mathp@chromium.org, sebsg@chromium.org

4 years, 5 months ago (2016-07-08 18:08:59 UTC) #2

Mathieu

Looks good, one question https://codereview.chromium.org/2132103002/diff/1/components/autofill/core/browser/autofill_data_util.cc File components/autofill/core/browser/autofill_data_util.cc (right): https://codereview.chromium.org/2132103002/diff/1/components/autofill/core/browser/autofill_data_util.cc#newcode130 components/autofill/core/browser/autofill_data_util.cc:130: // Returns true if |name| ...

4 years, 5 months ago (2016-07-11 13:49:15 UTC) #4

nicolaso

On 2016/07/11 13:49:15, Mathieu Perreault wrote: > Looks good, one question > > https://codereview.chromium.org/2132103002/diff/1/components/autofill/core/browser/autofill_data_util.cc > ...

4 years, 5 months ago (2016-07-11 15:28:46 UTC) #5

nicolaso

https://codereview.chromium.org/2132103002/diff/1/components/autofill/core/browser/autofill_data_util.cc File components/autofill/core/browser/autofill_data_util.cc (right): https://codereview.chromium.org/2132103002/diff/1/components/autofill/core/browser/autofill_data_util.cc#newcode130 components/autofill/core/browser/autofill_data_util.cc:130: // Returns true if |name| looks like a Korean ...

4 years, 5 months ago (2016-07-11 15:28:54 UTC) #6

https://codereview.chromium.org/2132103002/diff/1/components/autofill/core/br...
File components/autofill/core/browser/autofill_data_util.cc (right):

https://codereview.chromium.org/2132103002/diff/1/components/autofill/core/br...
components/autofill/core/browser/autofill_data_util.cc:130: // Returns true if
|name| looks like a Korean name, in Hangul characters.
On 2016/07/11 13:49:15, Mathieu Perreault wrote:
> "in Hangul characters" -> "made up entirely of Hangul characters or spaces"?

Done.

https://codereview.chromium.org/2132103002/diff/1/components/autofill/core/br...
components/autofill/core/browser/autofill_data_util.cc:151: // FIXME(89111):
Japanese names with no space will be mis-split, since we
On 2016/07/11 13:49:15, Mathieu Perreault wrote:
> TODO(crbug.com/89111)

Done.

https://codereview.chromium.org/2132103002/diff/1/components/autofill/core/br...
components/autofill/core/browser/autofill_data_util.cc:211: // TODO(89111):
Hungarian, Tamil, Telugu, and Vietnamese also have the
On 2016/07/11 13:49:15, Mathieu Perreault wrote:
> format is TODO(crbug.com/89111)

Done.

https://codereview.chromium.org/2132103002/diff/1/components/autofill/core/br...
components/autofill/core/browser/autofill_data_util.cc:215: if (IsCJKName(name)
&& SplitCJKName(name_tokens, &parts)) {
On 2016/07/11 13:49:15, Mathieu Perreault wrote:
> Are there cases where a name is made up of CJK characters of different
scripts?
> If not, I'm thinking IsCJKName could instead return the script, which could be
> used in SplitCJKName (would avoid going over the name twice). 
> 
> Let me know if such a thing could be possible, thanks

Off the top of my head, the only one I can think of, is someone whose name is
like "Bruce Takeshi". Japanese uses one script for Japanese names, and another
for foreign names. "Bruce" and "Takeshi" *could* be in different scripts.

As far as performance is concerned (since we re-traverse the name twice, with
`IsHangulName()'), I don't think we need to worry too much. Most CJK names are
made up of about 3 characters. A quick benchmark also says
`uscript_getScript(c)' is even faster than looking up `c' in an
`std::unordered_map', meaning that `IsCJK(c)' and `IsHangul(c)' are pretty fast
as well.

Mathieu

lgtm, thanks https://codereview.chromium.org/2132103002/diff/20001/components/autofill/core/browser/autofill_data_util.cc File components/autofill/core/browser/autofill_data_util.cc (right): https://codereview.chromium.org/2132103002/diff/20001/components/autofill/core/browser/autofill_data_util.cc#newcode157 components/autofill/core/browser/autofill_data_util.cc:157: // TODO(crbug.com/89111): Japanese names with no space ...

4 years, 5 months ago (2016-07-12 14:39:00 UTC) #7

sebsg

It LGTM for me but I'd wait to see if someone from a Korean office ...

4 years, 5 months ago (2016-07-12 18:13:36 UTC) #8

nicolaso

https://codereview.chromium.org/2132103002/diff/20001/components/autofill/core/browser/autofill_data_util.cc File components/autofill/core/browser/autofill_data_util.cc (right): https://codereview.chromium.org/2132103002/diff/20001/components/autofill/core/browser/autofill_data_util.cc#newcode157 components/autofill/core/browser/autofill_data_util.cc:157: // TODO(crbug.com/89111): Japanese names with no space will be ...

4 years, 5 months ago (2016-07-12 18:46:08 UTC) #9

nicolaso

nicolaso@google.com changed reviewers: + gogerald@chromium.org

4 years, 5 months ago (2016-07-12 19:07:14 UTC) #10

Jinsuk Kim

jinsukkim@chromium.org changed reviewers: + jinsukkim@chromium.org

4 years, 5 months ago (2016-07-12 22:50:47 UTC) #11

Jinsuk Kim

https://codereview.chromium.org/2132103002/diff/40001/components/autofill/core/browser/autofill_data_util.cc File components/autofill/core/browser/autofill_data_util.cc (right): https://codereview.chromium.org/2132103002/diff/40001/components/autofill/core/browser/autofill_data_util.cc#newcode38 components/autofill/core/browser/autofill_data_util.cc:38: // https://namu.wiki/w/%ED%95%9C%EA%B5%AD%EC%9D%98%20%EC%84%B1%EC%94%A8#s-6 Consider changing the reference to the article ...

4 years, 5 months ago (2016-07-12 22:50:48 UTC) #12

https://codereview.chromium.org/2132103002/diff/40001/components/autofill/cor...
File components/autofill/core/browser/autofill_data_util.cc (right):

https://codereview.chromium.org/2132103002/diff/40001/components/autofill/cor...
components/autofill/core/browser/autofill_data_util.cc:38: //
https://namu.wiki/w/%ED%95%9C%EA%B5%AD%EC%9D%98%20%EC%84%B1%EC%94%A8#s-6
Consider changing the reference to the article in wikipedia.
https://ko.wikipedia.org/wiki/%ED%95%9C%EA%B5%AD%EC%9D%98_%EC%84%B1%EC%94%A8_...
Namu Wiki, though getting more and more popular, allows lots of geeky,
controversial, cult-like contents only available through unofficial channels.
wikipedia is a more trustworthy source.

https://codereview.chromium.org/2132103002/diff/40001/components/autofill/cor...
components/autofill/core/browser/autofill_data_util.cc:168: // Korean full names
always have at least 3 characters. So, if there are
This is not entirely true. Names with 1-char surname + 1-char given name (1/1
for short) are also quite common. You need to check if the first letter is one
of the surnames in the reference. It is not always possible to tell 1/1 from 0/2
(given name only) though. But this is worth checking if the input is supposed to
contain both sur-/given name combo.

https://codereview.chromium.org/2132103002/diff/40001/components/autofill/cor...
components/autofill/core/browser/autofill_data_util.cc:175: // 4-character
Korean full names default to a 2-character surname. It's
You should still check if the first 2-character word is one of the surnames in
|cjk_multi_char_surnames|. There are people with 3-character (or longer) given
name. Korean surname is a well-defined list as shown in the reference(i.e. you
cannot invent your own surname and use it). The first 2 char is not a surname if
it is not in the list.

https://codereview.chromium.org/2132103002/diff/40001/components/autofill/cor...
components/autofill/core/browser/autofill_data_util.cc:186: if
(base::StartsWith(name, surname, base::CompareCase::SENSITIVE)) {
This is more complicated because the name is more likely to be 1/2 than 2/1
statistics wise. I suggest the input be categorized in following cases:

A) 1/2: Checking the first character against the list of 1-char surname list
will give much better result. A name like "강전희(Kang Jeon Hee)" (an old friend of
mine) has the first 2 letters that can be a surname in |cjk_multi_char_surnames|
but actually her surname is only 'Kang' which is one of the most common surnames
in Korea, and her given name is 'Jeon Hee'. Same goes for "동방", "소봉", "어금",
"장곡", and "황목" whose first letter is a surname more common that whole 2-letter
one. In fact I've never heard/met people with these 2-char surnames. They are
quite obscure ones.
https://ko.wikipedia.org/wiki/%ED%95%9C%EA%B5%AD%EC%9D%98_%EB%B3%B5%EC%84%B1
shows that there are about 70 families of the name "어금". Likewise, There are
only 70 동방, 50 소봉, 17 장곡, 12 강전 families, etc.

B) 2/1: "남궁", "사공", "서문", "선우", "제갈" also have the first char that can be a
surname but the whole 2-char one is much more likely to be a right surname.

C) Easy decision 2/1: "독고", "망절" don't need such checking since their first
letter cannot be a surname by itself.

And I'm not really certain about "황보" - it can be either A) or B). I'd suggest
B).

Note that this is a subjective rule, but I don't think it's very biased.

In conclusion, the surname to suggest for each input are:

A) "강전", "동방", "소봉", "어금", "장곡", "황목" : first char only: i.e., "강", "동", "소",
"어", "장", "황"
B) "남궁", "사공", "서문", "선우", "제갈", "황보" : the whole 2 chars
C) "독고", "망절" : ditto

nicolaso

https://codereview.chromium.org/2132103002/diff/40001/components/autofill/core/browser/autofill_data_util.cc File components/autofill/core/browser/autofill_data_util.cc (right): https://codereview.chromium.org/2132103002/diff/40001/components/autofill/core/browser/autofill_data_util.cc#newcode38 components/autofill/core/browser/autofill_data_util.cc:38: // https://namu.wiki/w/%ED%95%9C%EA%B5%AD%EC%9D%98%20%EC%84%B1%EC%94%A8#s-6 On 2016/07/12 22:50:48, Jinsuk wrote: > Consider ...

4 years, 5 months ago (2016-07-13 16:31:06 UTC) #13

https://codereview.chromium.org/2132103002/diff/40001/components/autofill/cor...
File components/autofill/core/browser/autofill_data_util.cc (right):

https://codereview.chromium.org/2132103002/diff/40001/components/autofill/cor...
components/autofill/core/browser/autofill_data_util.cc:38: //
https://namu.wiki/w/%ED%95%9C%EA%B5%AD%EC%9D%98%20%EC%84%B1%EC%94%A8#s-6
On 2016/07/12 22:50:48, Jinsuk wrote:
> Consider changing the reference to the article in wikipedia.
>
https://ko.wikipedia.org/wiki/%ED%95%9C%EA%B5%AD%EC%9D%98_%EC%84%B1%EC%94%A8_...
> Namu Wiki, though getting more and more popular, allows lots of geeky,
> controversial, cult-like contents only available through unofficial channels.
> wikipedia is a more trustworthy source.

Done.

https://codereview.chromium.org/2132103002/diff/40001/components/autofill/cor...
components/autofill/core/browser/autofill_data_util.cc:168: // Korean full names
always have at least 3 characters. So, if there are
On 2016/07/12 22:50:48, Jinsuk wrote:
> This is not entirely true. Names with 1-char surname + 1-char given name (1/1
> for short) are also quite common. You need to check if the first letter is one
> of the surnames in the reference. It is not always possible to tell 1/1 from
0/2
> (given name only) though. But this is worth checking if the input is supposed
to
> contain both sur-/given name combo.

Should we check that the first  character is a very common surname? (like 김, 이,
최, 박, ...) If it is, it's more likely to be 1/1 than 0/2. If it's not a common
surname, then we could use 0/2.

https://codereview.chromium.org/2132103002/diff/40001/components/autofill/cor...
components/autofill/core/browser/autofill_data_util.cc:175: // 4-character
Korean full names default to a 2-character surname. It's
On 2016/07/12 22:50:48, Jinsuk wrote:
> You should still check if the first 2-character word is one of the surnames in
> |cjk_multi_char_surnames|. There are people with 3-character (or longer) given
> name. Korean surname is a well-defined list as shown in the reference(i.e. you
> cannot invent your own surname and use it). The first 2 char is not a surname
if
> it is not in the list.

I used two separate lists, |common_cjk_multi_char_surnames| and
|korean_multi_char_surnames|. The first one only has the common ones (marked as
B and C in your other comment), and the second one includes every one from the
list, even rare ones like 강전 and 장곡.

Even though 1/2 is more likely than 2/1 for a 3-character name, it might not be
true for 2/2 vs. 1/3. This is why I used |korean_multi_char_surnames| here. I
could be completely wrong. If I am, we can just remove this `if' (and the
|korean_multi_char_surnames| declaration) completely.

https://codereview.chromium.org/2132103002/diff/40001/components/autofill/cor...
components/autofill/core/browser/autofill_data_util.cc:186: if
(base::StartsWith(name, surname, base::CompareCase::SENSITIVE)) {
On 2016/07/12 22:50:48, Jinsuk wrote:
> This is more complicated because the name is more likely to be 1/2 than 2/1
> statistics wise. I suggest the input be categorized in following cases:
> 
> A) 1/2: Checking the first character against the list of 1-char surname list
> will give much better result. A name like "강전희(Kang Jeon Hee)" (an old friend
of
> mine) has the first 2 letters that can be a surname in
|cjk_multi_char_surnames|
> but actually her surname is only 'Kang' which is one of the most common
surnames
> in Korea, and her given name is 'Jeon Hee'. Same goes for "동방", "소봉", "어금",
> "장곡", and "황목" whose first letter is a surname more common that whole 2-letter
> one. In fact I've never heard/met people with these 2-char surnames. They are
> quite obscure ones.
> https://ko.wikipedia.org/wiki/%ED%95%9C%EA%B5%AD%EC%9D%98_%EB%B3%B5%EC%84%B1
> shows that there are about 70 families of the name "어금". Likewise, There are
> only 70 동방, 50 소봉, 17 장곡, 12 강전 families, etc.
> 
> B) 2/1: "남궁", "사공", "서문", "선우", "제갈" also have the first char that can be a
> surname but the whole 2-char one is much more likely to be a right surname. 
> 
> C) Easy decision 2/1: "독고", "망절" don't need such checking since their first
> letter cannot be a surname by itself.
> 
> And I'm not really certain about "황보" - it can be either A) or B). I'd suggest
> B).
> 
> Note that this is a subjective rule, but I don't think it's very biased.
> 
> In conclusion, the surname to suggest for each input are:
> 
> A) "강전", "동방", "소봉", "어금", "장곡", "황목" : first char only: i.e., "강", "동", "소",
> "어", "장", "황"
> B) "남궁", "사공", "서문", "선우", "제갈", "황보" : the whole 2 chars
> C) "독고", "망절" : ditto
> 

I renamed |cjk_multi_char_surnames| to |common_cjk_multi_char_surnames|, and
removed the surnames you listed in A). Also added 강전희 as a test case.

Jinsuk Kim

Thanks for the update. Looks great. https://codereview.chromium.org/2132103002/diff/40001/components/autofill/core/browser/autofill_data_util.cc File components/autofill/core/browser/autofill_data_util.cc (right): https://codereview.chromium.org/2132103002/diff/40001/components/autofill/core/browser/autofill_data_util.cc#newcode168 components/autofill/core/browser/autofill_data_util.cc:168: // Korean full ...

4 years, 5 months ago (2016-07-13 21:51:09 UTC) #14

Thanks for the update. Looks great.

https://codereview.chromium.org/2132103002/diff/40001/components/autofill/cor...
File components/autofill/core/browser/autofill_data_util.cc (right):

https://codereview.chromium.org/2132103002/diff/40001/components/autofill/cor...
components/autofill/core/browser/autofill_data_util.cc:168: // Korean full names
always have at least 3 characters. So, if there are
On 2016/07/13 16:31:06, nicolaso wrote:
> On 2016/07/12 22:50:48, Jinsuk wrote:
> > This is not entirely true. Names with 1-char surname + 1-char given name
(1/1
> > for short) are also quite common. You need to check if the first letter is
one
> > of the surnames in the reference. It is not always possible to tell 1/1 from
> 0/2
> > (given name only) though. But this is worth checking if the input is
supposed
> to
> > contain both sur-/given name combo.
> 
> Should we check that the first  character is a very common surname? (like 김,
이,
> 최, 박, ...) If it is, it's more likely to be 1/1 than 0/2. If it's not a common
> surname, then we could use 0/2.

I wonder if it is safe to assume that people in most cases put their full name
(with or without space) for autofill setting - isn't it what autofill expects?
If it is, we can just go with 1/1 split, leaving a wrong input (only family name
or nickname in the name field) leading to a wrong output. 

If not, I agree on using only common names (like top 10) for checking. No point
using all of them - it would still give lots of false positives even with them
all.

https://codereview.chromium.org/2132103002/diff/40001/components/autofill/cor...
components/autofill/core/browser/autofill_data_util.cc:175: // 4-character
Korean full names default to a 2-character surname. It's
On 2016/07/13 16:31:06, nicolaso wrote:
> On 2016/07/12 22:50:48, Jinsuk wrote:
> > You should still check if the first 2-character word is one of the surnames
in
> > |cjk_multi_char_surnames|. There are people with 3-character (or longer)
given
> > name. Korean surname is a well-defined list as shown in the reference(i.e.
you
> > cannot invent your own surname and use it). The first 2 char is not a
surname
> if
> > it is not in the list.
> 
> I used two separate lists, |common_cjk_multi_char_surnames| and
> |korean_multi_char_surnames|. The first one only has the common ones (marked
as
> B and C in your other comment), and the second one includes every one from the
> list, even rare ones like 강전 and 장곡.
> 
> Even though 1/2 is more likely than 2/1 for a 3-character name, it might not
be
> true for 2/2 vs. 1/3. This is why I used |korean_multi_char_surnames| here. I
> could be completely wrong. If I am, we can just remove this `if' (and the
> |korean_multi_char_surnames| declaration) completely.

Acknowledged. Looks good.

https://codereview.chromium.org/2132103002/diff/40001/components/autofill/cor...
components/autofill/core/browser/autofill_data_util.cc:186: if
(base::StartsWith(name, surname, base::CompareCase::SENSITIVE)) {
On 2016/07/13 16:31:06, nicolaso wrote:
> On 2016/07/12 22:50:48, Jinsuk wrote:
> > This is more complicated because the name is more likely to be 1/2 than 2/1
> > statistics wise. I suggest the input be categorized in following cases:
> > 
> > A) 1/2: Checking the first character against the list of 1-char surname list
> > will give much better result. A name like "강전희(Kang Jeon Hee)" (an old
friend
> of
> > mine) has the first 2 letters that can be a surname in
> |cjk_multi_char_surnames|
> > but actually her surname is only 'Kang' which is one of the most common
> surnames
> > in Korea, and her given name is 'Jeon Hee'. Same goes for "동방", "소봉", "어금",
> > "장곡", and "황목" whose first letter is a surname more common that whole
2-letter
> > one. In fact I've never heard/met people with these 2-char surnames. They
are
> > quite obscure ones.
> > https://ko.wikipedia.org/wiki/%ED%95%9C%EA%B5%AD%EC%9D%98_%EB%B3%B5%EC%84%B1
> > shows that there are about 70 families of the name "어금". Likewise, There are
> > only 70 동방, 50 소봉, 17 장곡, 12 강전 families, etc.
> > 
> > B) 2/1: "남궁", "사공", "서문", "선우", "제갈" also have the first char that can be a
> > surname but the whole 2-char one is much more likely to be a right surname. 
> > 
> > C) Easy decision 2/1: "독고", "망절" don't need such checking since their first
> > letter cannot be a surname by itself.
> > 
> > And I'm not really certain about "황보" - it can be either A) or B). I'd
suggest
> > B).
> > 
> > Note that this is a subjective rule, but I don't think it's very biased.
> > 
> > In conclusion, the surname to suggest for each input are:
> > 
> > A) "강전", "동방", "소봉", "어금", "장곡", "황목" : first char only: i.e., "강", "동",
"소",
> > "어", "장", "황"
> > B) "남궁", "사공", "서문", "선우", "제갈", "황보" : the whole 2 chars
> > C) "독고", "망절" : ditto
> > 
> 
> I renamed |cjk_multi_char_surnames| to |common_cjk_multi_char_surnames|, and
> removed the surnames you listed in A). Also added 강전희 as a test case.

Looks good. How about adding a comment on the test case "강전희". It tests a
different flow in your logic. comment in the sense of "..choose 강 over 강전 since
the former is much more common".

https://codereview.chromium.org/2132103002/diff/60001/components/autofill/cor...
File components/autofill/core/browser/autofill_data_util.cc (right):

https://codereview.chromium.org/2132103002/diff/60001/components/autofill/cor...
components/autofill/core/browser/autofill_data_util.cc:35: // The most common
CJK surnames (last names) that have more than one character.
In fact it contains two groups. How about updating the comment "s/common/common
or non-ambiguous/" so other people reading the code won't stop and stare at
"독고"/"망절"

https://codereview.chromium.org/2132103002/diff/60001/components/autofill/cor...
File components/autofill/core/browser/autofill_data_util_unittest.cc (right):

https://codereview.chromium.org/2132103002/diff/60001/components/autofill/cor...
components/autofill/core/browser/autofill_data_util_unittest.cc:56: // Korean
full names always have at least 3 characters, so a 2-character
Please update the comment here.

nicolaso

https://codereview.chromium.org/2132103002/diff/60001/components/autofill/core/browser/autofill_data_util.cc File components/autofill/core/browser/autofill_data_util.cc (right): https://codereview.chromium.org/2132103002/diff/60001/components/autofill/core/browser/autofill_data_util.cc#newcode35 components/autofill/core/browser/autofill_data_util.cc:35: // The most common CJK surnames (last names) that ...

4 years, 5 months ago (2016-07-14 18:12:43 UTC) #15

gogerald1

lgtm with comments https://codereview.chromium.org/2132103002/diff/60001/components/autofill/core/browser/autofill_data_util.cc File components/autofill/core/browser/autofill_data_util.cc (right): https://codereview.chromium.org/2132103002/diff/60001/components/autofill/core/browser/autofill_data_util.cc#newcode214 components/autofill/core/browser/autofill_data_util.cc:214: } else if (name_tokens.size() == 2) ...

4 years, 5 months ago (2016-07-14 18:28:52 UTC) #16

nicolaso

https://codereview.chromium.org/2132103002/diff/60001/components/autofill/core/browser/autofill_data_util.cc File components/autofill/core/browser/autofill_data_util.cc (right): https://codereview.chromium.org/2132103002/diff/60001/components/autofill/core/browser/autofill_data_util.cc#newcode214 components/autofill/core/browser/autofill_data_util.cc:214: } else if (name_tokens.size() == 2) { On 2016/07/14 ...

4 years, 5 months ago (2016-07-14 20:32:17 UTC) #17

Jinsuk Kim

On 2016/07/14 18:12:43, nicolaso wrote: > https://codereview.chromium.org/2132103002/diff/60001/components/autofill/core/browser/autofill_data_util.cc > File components/autofill/core/browser/autofill_data_util.cc (right): > > https://codereview.chromium.org/2132103002/diff/60001/components/autofill/core/browser/autofill_data_util.cc#newcode35 > ...

4 years, 5 months ago (2016-07-14 21:35:44 UTC) #18

nicolaso

The patchset sent to the CQ was uploaded after l-g-t-m from mathp@chromium.org, sebsg@chromium.org, gogerald@chromium.org Link ...

4 years, 5 months ago (2016-07-18 14:18:51 UTC) #20