Issue 149449: Fix: Linux file:// listings are sorted but not fully internationalized...

Issue 149449: Fix: Linux file:// listings are sorted but not fully internationalized... (Closed)

Created:
11 years, 5 months ago by Yuzo

Modified:
9 years, 7 months ago

Reviewers:
TVL, jungshik at Google

CC:
chromium-reviews_googlegroups.com, John Grabowski, brettw

Base URL:
svn://chrome-svn.corp.google.com/chrome/trunk/src/

Visibility:
Public.

Description

Fix: Linux file:// listings are sorted but not fully internationalized Changed the code such that file names are converted to UTF-16 before comparison by a Collator. BUG=16179 TEST=Open a file:// URL for a directory that contains files with international names. Observe that files are sorted properly in the list. Committed: http://src.chromium.org/viewvc/chrome?view=rev&revision=20706

Patch Set 1 #

Patch Set 2 : '' #

Total comments: 7

Patch Set 3 : '' #

Total comments: 8

Patch Set 4 : '' #

Total comments: 8

Patch Set 5 : '' #

Created: 11 years, 5 months ago

Download [raw] [tar.bz2]

		Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+67 lines, -12 lines)			Patch
	M	base/file_util_posix.cc	View	1 2 3 4	3 chunks	+67 lines, -12 lines	0 comments	Download

Messages

Total messages: 8 (0 generated)

Expand Messages | Collapse Messages

TVL

driveby http://codereview.chromium.org/149449/diff/6/1001 File base/file_util_posix.cc (right): http://codereview.chromium.org/149449/diff/6/1001#newcode587 Line 587: // On linux, the file system encoding ...

11 years, 5 months ago (2009-07-10 12:18:27 UTC) #2

jungshik at Google

Thank you for catching this !! See my comments below. http://codereview.chromium.org/149449/diff/6/1001 File base/file_util_posix.cc (right): http://codereview.chromium.org/149449/diff/6/1001#newcode587 ...

11 years, 5 months ago (2009-07-10 18:24:44 UTC) #3

Thank you for catching this !!

See my comments below.

http://codereview.chromium.org/149449/diff/6/1001
File base/file_util_posix.cc (right):

http://codereview.chromium.org/149449/diff/6/1001#newcode587
Line 587: // On linux, the file system encoding is not defined. We assume
On 2009/07/10 12:18:27, TVL wrote:
> fyi - this file is posix, so it's not always linux (ie-macosx).  not sure if
> that changes what you'd want to do here.

On MacOS X, the encoding is well-defined (UTF-8 + NFD) as you're well aware of
:-). SysNativeMBToWide works even better, there :-)
(except that we may have to tweak a bit with NFD part)

http://codereview.chromium.org/149449/diff/6/1001#newcode592
Line 592: // proves to be false.
base already depends on ICU. So, we should ICU's collator (then, NFD etc will be
taken care of as well). Perhaps, this part has to be pulled out of
file_util_posix and put into file_util_icu.cc.  Hmm, Windows does not have a
corresponding function (it seems that it relies on the OS file enumerator that
will sort based on the OS locale, which can be different from Chrom's UI
locale).  

We init'd the ICU locale to match the Chrome's UI locale. So, you can just use
ICU's collator without specifying a locale. (hmm, this is base/. So, we may want
to offer a way to set the locale explicitly. When called from Chrome, we can set
the locale to GetApplicationLocale()). 

For ICU's collator, see http://icu-project.org/apiref/icu4c/classCollator.html

Slightly off-topic, chrome/app/l10n_util has string sorting functions using ICU.
Later, we may have to move it to base.

http://codereview.chromium.org/149449/diff/6/1001#newcode594
Line 594: WideToUTF16(base::SysNativeMBToWide((*b)->fts_name)));
Even if we do this, you don't need to convert it to UTF-16. This is for POSIX
(Mac and Linux) and their wstring is UTF-32 and can be compared by code-point.
On Mac, you don't even have to convert to Wide because it's UTF-8 (NFD). Of
course, the code-point comparision is not what we want here.

So, you have to use a real collator as I wrote above.

Yuzo

Thank you for your reviews. Can you take another look? Yuzo http://codereview.chromium.org/149449/diff/6/1001 File base/file_util_posix.cc (right): ...

11 years, 5 months ago (2009-07-13 07:45:11 UTC) #4

Thank you for your reviews.

Can you take another look?

Yuzo

http://codereview.chromium.org/149449/diff/6/1001
File base/file_util_posix.cc (right):

http://codereview.chromium.org/149449/diff/6/1001#newcode587
Line 587: // On linux, the file system encoding is not defined. We assume
TVL, Jungshik,

Thank you for your comments.
My understanding is the description is fine as-is.

Please let me know otherwise.

On 2009/07/10 18:24:44, Jungshik Shin wrote:
> On 2009/07/10 12:18:27, TVL wrote:
> > fyi - this file is posix, so it's not always linux (ie-macosx).  not sure if
> > that changes what you'd want to do here.
> 
> On MacOS X, the encoding is well-defined (UTF-8 + NFD) as you're well aware of
> :-). SysNativeMBToWide works even better, there :-)
> (except that we may have to tweak a bit with NFD part)

http://codereview.chromium.org/149449/diff/6/1001#newcode592
Line 592: // proves to be false.
Thank you for the detailed comment.

I changed the code to use Collator.

On 2009/07/10 18:24:44, Jungshik Shin wrote:
> base already depends on ICU. So, we should ICU's collator (then, NFD etc will
be
> taken care of as well). Perhaps, this part has to be pulled out of
> file_util_posix and put into file_util_icu.cc.  Hmm, Windows does not have a
> corresponding function (it seems that it relies on the OS file enumerator that
> will sort based on the OS locale, which can be different from Chrom's UI
> locale).  
> 
> 
> 
> We init'd the ICU locale to match the Chrome's UI locale. So, you can just use
> ICU's collator without specifying a locale. (hmm, this is base/. So, we may
want
> to offer a way to set the locale explicitly. When called from Chrome, we can
set
> the locale to GetApplicationLocale()). 
> 
> For ICU's collator, see http://icu-project.org/apiref/icu4c/classCollator.html
> 
> 
> Slightly off-topic, chrome/app/l10n_util has string sorting functions using
ICU.
> Later, we may have to move it to base.

http://codereview.chromium.org/149449/diff/6/1001#newcode594
Line 594: WideToUTF16(base::SysNativeMBToWide((*b)->fts_name)));
On 2009/07/10 18:24:44, Jungshik Shin wrote:
> Even if we do this, you don't need to convert it to UTF-16. This is for POSIX
> (Mac and Linux) and their wstring is UTF-32 and can be compared by code-point.
> On Mac, you don't even have to convert to Wide because it's UTF-8 (NFD). Of
> course, the code-point comparision is not what we want here.
> 
> So, you have to use a real collator as I wrote above. 

Yes. Done.

jungshik at Google

Thank you for addressing my comment. Let's just go over once more. http://codereview.chromium.org/149449/diff/11/1004 File base/file_util_posix.cc ...

11 years, 5 months ago (2009-07-13 19:14:23 UTC) #5

Thank you for addressing my comment. Let's just go over once more.

http://codereview.chromium.org/149449/diff/11/1004
File base/file_util_posix.cc (right):

http://codereview.chromium.org/149449/diff/11/1004#newcode56
Line 56: const Collator* GetCollator() {
Would it be better to use Singleton (base/singleton.h)? Either way seems fine to
me except that with the current code, |collator| is leaked at the program
termination, which shouldn't matter much, I guess.

http://codereview.chromium.org/149449/diff/11/1004#newcode60
Line 60: Collator* nonconst_collator = Collator::createInstance(error_code);
Can you add a comment that we don't pass a locale because we assume that the
locale is already set appropriately by the time this is called (so that using
the default locale works)?

http://codereview.chromium.org/149449/diff/11/1004#newcode62
Line 62: nonconst_collator->setStrength(Collator::TERTIARY);
For Japanese, which do you think it's better to sort Hiragana and Katakana,
separately or equivalently? I guess it's better to sort them equivalently. In an
unlikely case they need to be sorted separately, you have to check if the
current locale is ja_* and set the strength to QUATERNARY.

Locale loc=nonconst_collator->getLocale(ULOC_VALID_LOCALE, error_code);
if (!strncmp(loc.getName(), "ja", 2))
  set the strength to QUATERNARY

http://codereview.chromium.org/149449/diff/11/1004#newcode93
Line 93: WideToUTF16(base::SysNativeMBToWide((*b)->fts_name)));
In theory, you can just pass fts_name directly to ICU collator's compare()
because it'll treat it as the os native codepage string and convert to UTF-16. 
One of compare() accepts two UnicodeString and one of UnicodeString ctor's
accepts a null-terminated string and converts to UTF-16 internally as if it's in
'the OS native encoding'.

This should just work for Mac OS X. 

A potential problem on Linux is that nl_langinfo(CODESET) depends on the value
of LC_CTYPE and ICU's setDefaultLocale (sp?) may or may not pick up the correct
value. And, even if it does, there may be difference between ICU's converter and
glibc. So, let's just use SysNativeMB* for now although the double conversions
(SysMB->Wide->UTF16) looks rather ugly.

Can you add a brief(?) comment about this?

Yuzo

Hi, Jungshik, Thank you very much for the detailed review -- I learned a lot. ...

11 years, 5 months ago (2009-07-14 03:56:56 UTC) #6

Hi, Jungshik,

Thank you very much for the detailed review -- I learned a lot. :)
Can you take another look?

Yuzo

http://codereview.chromium.org/149449/diff/11/1004
File base/file_util_posix.cc (right):

http://codereview.chromium.org/149449/diff/11/1004#newcode56
Line 56: const Collator* GetCollator() {
On 2009/07/13 19:14:23, Jungshik Shin wrote:
> Would it be better to use Singleton (base/singleton.h)? Either way seems fine
to
> me except that with the current code, |collator| is leaked at the program
> termination, which shouldn't matter much, I guess.
>

Yeah, Singleton should be better. Changed.

http://codereview.chromium.org/149449/diff/11/1004#newcode60
Line 60: Collator* nonconst_collator = Collator::createInstance(error_code);
On 2009/07/13 19:14:23, Jungshik Shin wrote:
> Can you add a comment that we don't pass a locale because we assume that the
> locale is already set appropriately by the time this is called (so that using
> the default locale works)? 

Done.

http://codereview.chromium.org/149449/diff/11/1004#newcode62
Line 62: nonconst_collator->setStrength(Collator::TERTIARY);
On 2009/07/13 19:14:23, Jungshik Shin wrote:
> For Japanese, which do you think it's better to sort Hiragana and Katakana,
> separately or equivalently? I guess it's better to sort them equivalently. In
an
> unlikely case they need to be sorted separately, you have to check if the
> current locale is ja_* and set the strength to QUATERNARY.
> 
> Locale loc=nonconst_collator->getLocale(ULOC_VALID_LOCALE, error_code);
> if (!strncmp(loc.getName(), "ja", 2))
>   set the strength to QUATERNARY
> 
> 

Thank you for the detailed review.
I think TERTIARY is best also for Japanese.

http://codereview.chromium.org/149449/diff/11/1004#newcode93
Line 93: WideToUTF16(base::SysNativeMBToWide((*b)->fts_name)));
On 2009/07/13 19:14:23, Jungshik Shin wrote:
> In theory, you can just pass fts_name directly to ICU collator's compare()
> because it'll treat it as the os native codepage string and convert to UTF-16.

> One of compare() accepts two UnicodeString and one of UnicodeString ctor's
> accepts a null-terminated string and converts to UTF-16 internally as if it's
in
> 'the OS native encoding'.
> 
> This should just work for Mac OS X. 
> 
> A potential problem on Linux is that nl_langinfo(CODESET) depends on the value
> of LC_CTYPE and ICU's setDefaultLocale (sp?) may or may not pick up the
correct
> value. And, even if it does, there may be difference between ICU's converter
and
> glibc. So, let's just use SysNativeMB* for now although the double conversions
> (SysMB->Wide->UTF16) looks rather ugly.
> 
> Can you add a brief(?) comment about this? 
>  

Done.

jungshik at Google

LGTM !! Thank you for the fix ! http://codereview.chromium.org/149449/diff/1007/1008 File base/file_util_posix.cc (right): http://codereview.chromium.org/149449/diff/1007/1008#newcode60 Line 60: ...

11 years, 5 months ago (2009-07-14 05:36:26 UTC) #7

Yuzo

11 years, 5 months ago (2009-07-14 06:27:52 UTC) #8

Jungshik,

Thank you for reviewing this. All comments have been addressed.
I'll submit this shortly.

Yuzo

http://codereview.chromium.org/149449/diff/1007/1008
File base/file_util_posix.cc (right):

http://codereview.chromium.org/149449/diff/1007/1008#newcode60
Line 60: collator_->setStrength(Collator::TERTIARY);
On 2009/07/14 05:36:27, Jungshik Shin wrote:
> Yet another concern on Linux I meant to mention in the previous comment but
> forgot is that filenames can be in a non-FCD form (
> http://unicode.org/notes/tn5/#FCD ). However, the change is very low and I
guess
> we don't have to turn the attribute for normalization on.  (
>
http://icu-project.org/apiref/icu4c/ucol_8h.html#583fbe7fc4a850e2fcc692e766d2...
> ). Another brief comment to that effect would be nice. Sorry for asking yet
> another comment :-)
> 

I've learned yet another bit of UNICODE. :)
Comment added.

http://codereview.chromium.org/149449/diff/1007/1008#newcode65
Line 65: // TODO(yuzo): Move all or some of l10n_util to base.
On 2009/07/14 05:36:27, Jungshik Shin wrote:
> nit: I think only some of l10n_utils have to be moved :-). So, "Move some of
> ...' :-)
> 

Changed.

http://codereview.chromium.org/149449/diff/1007/1008#newcode87
Line 87: DISALLOW_EVIL_CONSTRUCTORS(LocaleAwareComparator);
On 2009/07/14 05:36:27, Jungshik Shin wrote:
> nit: DISALLOW_COPY_AND_ASSIGN

Changed.

http://codereview.chromium.org/149449/diff/1007/1008#newcode102
Line 102: // TODO(yuzo): Perhaps we should define SysNativeMBToUTF16?
On 2009/07/14 05:36:27, Jungshik Shin wrote:
> yup. I put exactly the same TODO comment in another CL. :-)

I see.

Expand Messages | Collapse Messages