|
|
Chromium Code Reviews|
Created:
11 years, 5 months ago by Yuzo Modified:
9 years, 7 months ago CC:
chromium-reviews_googlegroups.com, John Grabowski, brettw Base URL:
svn://chrome-svn.corp.google.com/chrome/trunk/src/ Visibility:
Public. |
DescriptionFix: Linux file:// listings are sorted but not fully internationalized
Changed the code such that file names are converted to UTF-16 before
comparison by a Collator.
BUG=16179
TEST=Open a file:// URL for a directory that contains files with international names. Observe that files are sorted properly in the list.
Committed: http://src.chromium.org/viewvc/chrome?view=rev&revision=20706
Patch Set 1 #Patch Set 2 : '' #
Total comments: 7
Patch Set 3 : '' #
Total comments: 8
Patch Set 4 : '' #
Total comments: 8
Patch Set 5 : '' #Messages
Total messages: 8 (0 generated)
Hi, Jungshik, Can you review this? Yuzo
driveby http://codereview.chromium.org/149449/diff/6/1001 File base/file_util_posix.cc (right): http://codereview.chromium.org/149449/diff/6/1001#newcode587 Line 587: // On linux, the file system encoding is not defined. We assume fyi - this file is posix, so it's not always linux (ie-macosx). not sure if that changes what you'd want to do here.
Thank you for catching this !! See my comments below. http://codereview.chromium.org/149449/diff/6/1001 File base/file_util_posix.cc (right): http://codereview.chromium.org/149449/diff/6/1001#newcode587 Line 587: // On linux, the file system encoding is not defined. We assume On 2009/07/10 12:18:27, TVL wrote: > fyi - this file is posix, so it's not always linux (ie-macosx). not sure if > that changes what you'd want to do here. On MacOS X, the encoding is well-defined (UTF-8 + NFD) as you're well aware of :-). SysNativeMBToWide works even better, there :-) (except that we may have to tweak a bit with NFD part) http://codereview.chromium.org/149449/diff/6/1001#newcode592 Line 592: // proves to be false. base already depends on ICU. So, we should ICU's collator (then, NFD etc will be taken care of as well). Perhaps, this part has to be pulled out of file_util_posix and put into file_util_icu.cc. Hmm, Windows does not have a corresponding function (it seems that it relies on the OS file enumerator that will sort based on the OS locale, which can be different from Chrom's UI locale). We init'd the ICU locale to match the Chrome's UI locale. So, you can just use ICU's collator without specifying a locale. (hmm, this is base/. So, we may want to offer a way to set the locale explicitly. When called from Chrome, we can set the locale to GetApplicationLocale()). For ICU's collator, see http://icu-project.org/apiref/icu4c/classCollator.html Slightly off-topic, chrome/app/l10n_util has string sorting functions using ICU. Later, we may have to move it to base. http://codereview.chromium.org/149449/diff/6/1001#newcode594 Line 594: WideToUTF16(base::SysNativeMBToWide((*b)->fts_name))); Even if we do this, you don't need to convert it to UTF-16. This is for POSIX (Mac and Linux) and their wstring is UTF-32 and can be compared by code-point. On Mac, you don't even have to convert to Wide because it's UTF-8 (NFD). Of course, the code-point comparision is not what we want here. So, you have to use a real collator as I wrote above.
Thank you for your reviews. Can you take another look? Yuzo http://codereview.chromium.org/149449/diff/6/1001 File base/file_util_posix.cc (right): http://codereview.chromium.org/149449/diff/6/1001#newcode587 Line 587: // On linux, the file system encoding is not defined. We assume TVL, Jungshik, Thank you for your comments. My understanding is the description is fine as-is. Please let me know otherwise. On 2009/07/10 18:24:44, Jungshik Shin wrote: > On 2009/07/10 12:18:27, TVL wrote: > > fyi - this file is posix, so it's not always linux (ie-macosx). not sure if > > that changes what you'd want to do here. > > On MacOS X, the encoding is well-defined (UTF-8 + NFD) as you're well aware of > :-). SysNativeMBToWide works even better, there :-) > (except that we may have to tweak a bit with NFD part) http://codereview.chromium.org/149449/diff/6/1001#newcode592 Line 592: // proves to be false. Thank you for the detailed comment. I changed the code to use Collator. On 2009/07/10 18:24:44, Jungshik Shin wrote: > base already depends on ICU. So, we should ICU's collator (then, NFD etc will be > taken care of as well). Perhaps, this part has to be pulled out of > file_util_posix and put into file_util_icu.cc. Hmm, Windows does not have a > corresponding function (it seems that it relies on the OS file enumerator that > will sort based on the OS locale, which can be different from Chrom's UI > locale). > > > > We init'd the ICU locale to match the Chrome's UI locale. So, you can just use > ICU's collator without specifying a locale. (hmm, this is base/. So, we may want > to offer a way to set the locale explicitly. When called from Chrome, we can set > the locale to GetApplicationLocale()). > > For ICU's collator, see http://icu-project.org/apiref/icu4c/classCollator.html > > > Slightly off-topic, chrome/app/l10n_util has string sorting functions using ICU. > Later, we may have to move it to base. http://codereview.chromium.org/149449/diff/6/1001#newcode594 Line 594: WideToUTF16(base::SysNativeMBToWide((*b)->fts_name))); On 2009/07/10 18:24:44, Jungshik Shin wrote: > Even if we do this, you don't need to convert it to UTF-16. This is for POSIX > (Mac and Linux) and their wstring is UTF-32 and can be compared by code-point. > On Mac, you don't even have to convert to Wide because it's UTF-8 (NFD). Of > course, the code-point comparision is not what we want here. > > So, you have to use a real collator as I wrote above. Yes. Done.
Thank you for addressing my comment. Let's just go over once more. http://codereview.chromium.org/149449/diff/11/1004 File base/file_util_posix.cc (right): http://codereview.chromium.org/149449/diff/11/1004#newcode56 Line 56: const Collator* GetCollator() { Would it be better to use Singleton (base/singleton.h)? Either way seems fine to me except that with the current code, |collator| is leaked at the program termination, which shouldn't matter much, I guess. http://codereview.chromium.org/149449/diff/11/1004#newcode60 Line 60: Collator* nonconst_collator = Collator::createInstance(error_code); Can you add a comment that we don't pass a locale because we assume that the locale is already set appropriately by the time this is called (so that using the default locale works)? http://codereview.chromium.org/149449/diff/11/1004#newcode62 Line 62: nonconst_collator->setStrength(Collator::TERTIARY); For Japanese, which do you think it's better to sort Hiragana and Katakana, separately or equivalently? I guess it's better to sort them equivalently. In an unlikely case they need to be sorted separately, you have to check if the current locale is ja_* and set the strength to QUATERNARY. Locale loc=nonconst_collator->getLocale(ULOC_VALID_LOCALE, error_code); if (!strncmp(loc.getName(), "ja", 2)) set the strength to QUATERNARY http://codereview.chromium.org/149449/diff/11/1004#newcode93 Line 93: WideToUTF16(base::SysNativeMBToWide((*b)->fts_name))); In theory, you can just pass fts_name directly to ICU collator's compare() because it'll treat it as the os native codepage string and convert to UTF-16. One of compare() accepts two UnicodeString and one of UnicodeString ctor's accepts a null-terminated string and converts to UTF-16 internally as if it's in 'the OS native encoding'. This should just work for Mac OS X. A potential problem on Linux is that nl_langinfo(CODESET) depends on the value of LC_CTYPE and ICU's setDefaultLocale (sp?) may or may not pick up the correct value. And, even if it does, there may be difference between ICU's converter and glibc. So, let's just use SysNativeMB* for now although the double conversions (SysMB->Wide->UTF16) looks rather ugly. Can you add a brief(?) comment about this?
Hi, Jungshik, Thank you very much for the detailed review -- I learned a lot. :) Can you take another look? Yuzo http://codereview.chromium.org/149449/diff/11/1004 File base/file_util_posix.cc (right): http://codereview.chromium.org/149449/diff/11/1004#newcode56 Line 56: const Collator* GetCollator() { On 2009/07/13 19:14:23, Jungshik Shin wrote: > Would it be better to use Singleton (base/singleton.h)? Either way seems fine to > me except that with the current code, |collator| is leaked at the program > termination, which shouldn't matter much, I guess. > Yeah, Singleton should be better. Changed. http://codereview.chromium.org/149449/diff/11/1004#newcode60 Line 60: Collator* nonconst_collator = Collator::createInstance(error_code); On 2009/07/13 19:14:23, Jungshik Shin wrote: > Can you add a comment that we don't pass a locale because we assume that the > locale is already set appropriately by the time this is called (so that using > the default locale works)? Done. http://codereview.chromium.org/149449/diff/11/1004#newcode62 Line 62: nonconst_collator->setStrength(Collator::TERTIARY); On 2009/07/13 19:14:23, Jungshik Shin wrote: > For Japanese, which do you think it's better to sort Hiragana and Katakana, > separately or equivalently? I guess it's better to sort them equivalently. In an > unlikely case they need to be sorted separately, you have to check if the > current locale is ja_* and set the strength to QUATERNARY. > > Locale loc=nonconst_collator->getLocale(ULOC_VALID_LOCALE, error_code); > if (!strncmp(loc.getName(), "ja", 2)) > set the strength to QUATERNARY > > Thank you for the detailed review. I think TERTIARY is best also for Japanese. http://codereview.chromium.org/149449/diff/11/1004#newcode93 Line 93: WideToUTF16(base::SysNativeMBToWide((*b)->fts_name))); On 2009/07/13 19:14:23, Jungshik Shin wrote: > In theory, you can just pass fts_name directly to ICU collator's compare() > because it'll treat it as the os native codepage string and convert to UTF-16. > One of compare() accepts two UnicodeString and one of UnicodeString ctor's > accepts a null-terminated string and converts to UTF-16 internally as if it's in > 'the OS native encoding'. > > This should just work for Mac OS X. > > A potential problem on Linux is that nl_langinfo(CODESET) depends on the value > of LC_CTYPE and ICU's setDefaultLocale (sp?) may or may not pick up the correct > value. And, even if it does, there may be difference between ICU's converter and > glibc. So, let's just use SysNativeMB* for now although the double conversions > (SysMB->Wide->UTF16) looks rather ugly. > > Can you add a brief(?) comment about this? > Done.
LGTM !! Thank you for the fix ! http://codereview.chromium.org/149449/diff/1007/1008 File base/file_util_posix.cc (right): http://codereview.chromium.org/149449/diff/1007/1008#newcode60 Line 60: collator_->setStrength(Collator::TERTIARY); Yet another concern on Linux I meant to mention in the previous comment but forgot is that filenames can be in a non-FCD form ( http://unicode.org/notes/tn5/#FCD ). However, the change is very low and I guess we don't have to turn the attribute for normalization on. ( http://icu-project.org/apiref/icu4c/ucol_8h.html#583fbe7fc4a850e2fcc692e766d2... ). Another brief comment to that effect would be nice. Sorry for asking yet another comment :-) http://codereview.chromium.org/149449/diff/1007/1008#newcode65 Line 65: // TODO(yuzo): Move all or some of l10n_util to base. nit: I think only some of l10n_utils have to be moved :-). So, "Move some of ...' :-) http://codereview.chromium.org/149449/diff/1007/1008#newcode87 Line 87: DISALLOW_EVIL_CONSTRUCTORS(LocaleAwareComparator); nit: DISALLOW_COPY_AND_ASSIGN http://codereview.chromium.org/149449/diff/1007/1008#newcode102 Line 102: // TODO(yuzo): Perhaps we should define SysNativeMBToUTF16? yup. I put exactly the same TODO comment in another CL. :-)
Jungshik, Thank you for reviewing this. All comments have been addressed. I'll submit this shortly. Yuzo http://codereview.chromium.org/149449/diff/1007/1008 File base/file_util_posix.cc (right): http://codereview.chromium.org/149449/diff/1007/1008#newcode60 Line 60: collator_->setStrength(Collator::TERTIARY); On 2009/07/14 05:36:27, Jungshik Shin wrote: > Yet another concern on Linux I meant to mention in the previous comment but > forgot is that filenames can be in a non-FCD form ( > http://unicode.org/notes/tn5/#FCD ). However, the change is very low and I guess > we don't have to turn the attribute for normalization on. ( > http://icu-project.org/apiref/icu4c/ucol_8h.html#583fbe7fc4a850e2fcc692e766d2... > ). Another brief comment to that effect would be nice. Sorry for asking yet > another comment :-) > I've learned yet another bit of UNICODE. :) Comment added. http://codereview.chromium.org/149449/diff/1007/1008#newcode65 Line 65: // TODO(yuzo): Move all or some of l10n_util to base. On 2009/07/14 05:36:27, Jungshik Shin wrote: > nit: I think only some of l10n_utils have to be moved :-). So, "Move some of > ...' :-) > Changed. http://codereview.chromium.org/149449/diff/1007/1008#newcode87 Line 87: DISALLOW_EVIL_CONSTRUCTORS(LocaleAwareComparator); On 2009/07/14 05:36:27, Jungshik Shin wrote: > nit: DISALLOW_COPY_AND_ASSIGN Changed. http://codereview.chromium.org/149449/diff/1007/1008#newcode102 Line 102: // TODO(yuzo): Perhaps we should define SysNativeMBToUTF16? On 2009/07/14 05:36:27, Jungshik Shin wrote: > yup. I put exactly the same TODO comment in another CL. :-) I see. |
