|
|
Chromium Code Reviews|
Created:
5 years, 4 months ago by mcindy Modified:
5 years, 4 months ago Reviewers:
Andrew Hayden (chromium.org) CC:
chromium-reviews, chaotian1, amalika, smn, dsites_google.com, hajimehoshi, droger Base URL:
https://chromium.googlesource.com/chromium/src.git@master Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
DescriptionImplement CLD hints to CLD2 calls. Edit CLD2 result to return top language instead of summary language.
BUG=517628
Committed: https://crrev.com/55614df31bd8645d20bd824a486305e9545eba09
Cr-Commit-Position: refs/heads/master@{#342232}
Patch Set 1 #Patch Set 2 : DetectLanguageSummaryV2 function call change #
Total comments: 2
Patch Set 3 : Fixed patch error #Patch Set 4 : Added inline comments on CLD params #Messages
Total messages: 20 (8 generated)
mcindy@google.com changed reviewers: + andrewhayden@chromium.org
Implement the use of HTML and Content language as CLD2 hint to the call of DetermineTextLanguage.
This looks good to me. Thank you for switching to using the top returned language, not the bogus summary language (which differs from the top language sometimes in order to match the Google production server LangId behavior; I regret having done this). /dick On Tue, Aug 4, 2015 at 9:25 AM, <mcindy@google.com> wrote: > Reviewers: Andrew Hayden (chromium.org), > > Message: > Implement the use of HTML and Content language as CLD2 hint to the call of > DetermineTextLanguage. > > Description: > Implement CLD hints to CLD2 calls. Edit CLD2 result to return top language > instead of summary language. > > BUG= > > Please review this at https://codereview.chromium.org/1263613002/ > > Base URL: https://chromium.googlesource.com/chromium/src.git@master > > Affected files (+39, -15 lines): > M components/translate/core/language_detection/language_detection_util.cc > > > Index: > components/translate/core/language_detection/language_detection_util.cc > diff --git > a/components/translate/core/language_detection/language_detection_util.cc > b/components/translate/core/language_detection/language_detection_util.cc > index > 5c751a7ffd59fb1427bb7597a306bfb2ba9a0d3c..1697fbed5ba4bd6a0c7b5dd54cc4ea747e11df5c > 100644 > --- > a/components/translate/core/language_detection/language_detection_util.cc > +++ > b/components/translate/core/language_detection/language_detection_util.cc > @@ -21,6 +21,7 @@ > > #if !defined(CLD_VERSION) || CLD_VERSION==2 > #include "third_party/cld_2/src/public/compact_lang_det.h" > +#include "third_party/cld_2/src/public/encodings.h" > #endif > > namespace { > @@ -86,7 +87,9 @@ int GetCLDMajorVersion() { > // failed. > // |is_cld_reliable| will be set as true if CLD says the detection is > reliable. > std::string DetermineTextLanguage(const base::string16& text, > - bool* is_cld_reliable) { > + bool* is_cld_reliable, > + std::string& code, > + std::string& html_lang) { > std::string language = translate::kUnknownLanguageCode; > int num_bytes_evaluated = 0; > bool is_reliable = false; > @@ -114,21 +117,41 @@ std::string DetermineTextLanguage(const > base::string16& text, > const std::string utf8_text(base::UTF16ToUTF8(text)); > const int num_utf8_bytes = static_cast<int>(utf8_text.size()); > const char* raw_utf8_bytes = utf8_text.c_str(); > - cld_language = CLD2::DetectLanguageCheckUTF8( > - raw_utf8_bytes, num_utf8_bytes, is_plain_text, &is_reliable, > - &num_bytes_evaluated); > + > + CLD2::Language language3[3]; > + int percent3[3]; > + int flags = 0; // No flags, see compact_lang_det.h for details. > + int text_bytes; // Amount of non-tag/letters-only text (assumed 0). > + double normalized_score3[3]; > + > + const char* tld_hint = ""; > + int encoding_hint = CLD2::UNKNOWN_ENCODING; > + CLD2::Language language_hint = > + CLD2::GetLanguageFromName(html_lang.c_str()); > + CLD2::CLDHints cldhints = {code.c_str(), tld_hint, encoding_hint, > + language_hint}; > + > + cld_language = CLD2::ExtDetectLanguageSummaryCheckUTF8( > + raw_utf8_bytes, num_utf8_bytes, is_plain_text, &cldhints, flags, > + language3, percent3, normalized_score3, nullptr, &text_bytes, > + &is_reliable, &num_bytes_evaluated); > > if (num_bytes_evaluated < num_utf8_bytes && > cld_language == CLD2::UNKNOWN_LANGUAGE) { > // Invalid UTF8 encountered, see bug http://crbug.com/444258. > // Retry using only the valid characters. This time the check for > valid > // UTF8 can be skipped since the precise number of valid bytes is > known. > - cld_language = CLD2::DetectLanguage(raw_utf8_bytes, > num_bytes_evaluated, > - is_plain_text, &is_reliable); > + cld_language = CLD2::ExtDetectLanguageSummary( > + raw_utf8_bytes, num_utf8_bytes, is_plain_text, &cldhints, > flags, > + language3, percent3, normalized_score3, nullptr, &text_bytes, > + &is_reliable); > } > is_valid_language = cld_language != CLD2::NUM_LANGUAGES && > cld_language != CLD2::UNKNOWN_LANGUAGE && > cld_language != CLD2::TG_UNKNOWN_LANGUAGE; > + > + // Choose top language. > + cld_language = language3[0]; > break; > } > #endif > @@ -213,15 +236,6 @@ std::string DeterminePageLanguage(const std::string& > code, > bool* is_cld_reliable_p) { > base::TimeTicks begin_time = base::TimeTicks::Now(); > bool is_cld_reliable; > - std::string cld_language = DetermineTextLanguage(contents, > &is_cld_reliable); > - translate::ReportLanguageDetectionTime(begin_time, > base::TimeTicks::Now()); > - > - if (cld_language_p != NULL) > - *cld_language_p = cld_language; > - if (is_cld_reliable_p != NULL) > - *is_cld_reliable_p = is_cld_reliable; > - translate::ToTranslateLanguageSynonym(&cld_language); > - > // Check if html lang attribute is valid. > std::string modified_html_lang; > if (!html_lang.empty()) { > @@ -239,6 +253,16 @@ std::string DeterminePageLanguage(const std::string& > code, > translate::ReportContentLanguage(code, modified_code); > } > > + std::string cld_language = DetermineTextLanguage( > + contents, &is_cld_reliable, modified_code, modified_html_lang); > + translate::ReportLanguageDetectionTime(begin_time, > base::TimeTicks::Now()); > + > + if (cld_language_p != NULL) > + *cld_language_p = cld_language; > + if (is_cld_reliable_p != NULL) > + *is_cld_reliable_p = is_cld_reliable; > + translate::ToTranslateLanguageSynonym(&cld_language); > + > // Adopt |modified_html_lang| if it is valid. Otherwise, adopt > // |modified_code|. > std::string language = modified_html_lang.empty() ? modified_code : > > > To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
LGTM, with minor nit about commenting the nullptr. This has the potential to cause some regressions in page translation, though one would hope it will be a net win for translate. I would advise subscribing to the "UI-Browser-Translate" label in Chromium issues for a while to triage any incoming problem reports. CC'ing hajimehoshi@, droger@, and toyoshim@ to make them aware of this change in case we start seeing an influx of reports of different translation choices. I'd recommend sending an explicit email to them all as well, just to ensure everyone is on the same page for this. https://codereview.chromium.org/1263613002/diff/20001/components/translate/co... File components/translate/core/language_detection/language_detection_util.cc (right): https://codereview.chromium.org/1263613002/diff/20001/components/translate/co... components/translate/core/language_detection/language_detection_util.cc:136: language3, percent3, normalized_score3, nullptr, &text_bytes, Please comment the meaning of the nullptr here, we typically would do something like this: lib::SomeFunction(param1, param2, param3, param4, nullptr /* super_special_parameter */, param5, param6); https://codereview.chromium.org/1263613002/diff/20001/components/translate/co... components/translate/core/language_detection/language_detection_util.cc:146: language3, percent3, normalized_score3, nullptr, &text_bytes, And same here, please
Can you have a meta bug to track this series of changes for chrome.i18n.detectLanguage API?
I am curious to see how the existing tests fare on this, so kicking off a dry run without the CQ bit set.
The CQ bit was checked by andrewhayden@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1263613002/20001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1263613002/20001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_compile_dbg_32_ng on tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...) ios_dbg_simulator_ninja on tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios_dbg_simulator...) ios_rel_device_ninja on tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios_rel_device_ni...) mac_chromium_10.10_rel_ng on tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_10.1...) mac_chromium_compile_dbg_ng on tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_comp...) mac_chromium_rel_ng on tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_...)
Patchset #3 (id:40001) has been deleted
The CQ bit was checked by mcindy@google.com to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1263613002/80001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1263613002/80001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
The CQ bit was checked by mcindy@google.com
The patchset sent to the CQ was uploaded after l-g-t-m from andrewhayden@chromium.org Link to the patchset: https://codereview.chromium.org/1263613002/#ps80001 (title: "Added inline comments on CLD params")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1263613002/80001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1263613002/80001
Message was sent while issue was closed.
Committed patchset #4 (id:80001)
Message was sent while issue was closed.
Patchset 4 (id:??) landed as https://crrev.com/55614df31bd8645d20bd824a486305e9545eba09 Cr-Commit-Position: refs/heads/master@{#342232} |
