|
|
Created:
4 years, 3 months ago by Anton Bakalov Modified:
4 years, 3 months ago CC:
chromium-reviews, djweiss, riesa, Jackie Quinn Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
DescriptionUpdating the CLD3 GitHub commit hash in DEPS
This change results in chunks from the input getting removed if they
contain repetitive strings or mostly spaces.
BUG=
Committed: https://crrev.com/e73e1a1f69d9ac7dc8216f930544aa5650d74ef2
Cr-Commit-Position: refs/heads/master@{#414767}
Patch Set 1 #Patch Set 2 : Adding more text to the English test webpage #Patch Set 3 : (1) reverting the test file, (2) specifying the commit hash with the new space percentage threshold #
Total comments: 1
Messages
Total messages: 29 (18 generated)
The CQ bit was checked by abakalov@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
abakalov@chromium.org changed reviewers: + andrewhayden@chromium.org
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_chromeos_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
Looking at upstream commit: https://github.com/google/cld3/commit/7a341405cdafbe82478f99702777d98cfa32151f This LGTM. It's unclear why some of the tests on some platforms are seeing unexpected language results. It might be worth re-running them to see if it was just a lag in the chromium git mirror sync with the upstream github project. Otherwise, there may be more to do upstream if these tests are failing legitimately.
The CQ bit was checked by abakalov@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
abakalov@chromium.org changed reviewers: + phajdan.jr@chromium.org
+ Paweł who is an OWNER of the tests This CL removes chunks from the input if they contain repetitive strings or mostly spaces. This processing is helpful for the language detector (CLD3) when dealing with gibberish text and input which is source code (e.g., Chromium review pages). In one of the tests, removing a substring resulted in going below the the threshold of 100 bytes needed for CLD3 (and its predecessor CLD2). As a result, the prediction was "unknown". I replaced the existing text with a couple of sentences from chromium.org. What do you guys think?
I imagine that we only lost a couple bytes from the old sentence, though, since the upstream change drops isolated characters and repetitive sequences... maybe you could just add a little more text (like "English is a language you can speak, with your mouth!" in the vein of the existing humor) instead of replacing it with lots more text. Still LGTM though.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: win_chromium_x64_rel_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_x64_...)
Thanks for the review, Andrew! I thought that the title was included in the text that gets passed to the language detector. That turned out not to be the case, so my original interpretation is incorrect. What is actually happening is that the percentage of the spaces in the substring "Awesome don t you think It has to be more than bytes long" is 25%, which is the threshold used by the new code. So, this piece of text got filtered out. I bumped up the threshold to 30%, and the test passes. I plan to keep this new threshold and revert the test file. Let me know if you think otherwise.
The CQ bit was checked by abakalov@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
The CQ bit was checked by abakalov@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from andrewhayden@chromium.org Link to the patchset: https://codereview.chromium.org/2283683002/#ps40001 (title: "(1) reverting the test file, (2) specifying the commit hash with the new space percentage threshold")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Message was sent while issue was closed.
Committed patchset #3 (id:40001)
Message was sent while issue was closed.
Description was changed from ========== Updating the CLD3 GitHub commit hash in DEPS This change results in chunks from the input getting removed if they contain repetitive strings or mostly spaces. BUG= ========== to ========== Updating the CLD3 GitHub commit hash in DEPS This change results in chunks from the input getting removed if they contain repetitive strings or mostly spaces. BUG= Committed: https://crrev.com/e73e1a1f69d9ac7dc8216f930544aa5650d74ef2 Cr-Commit-Position: refs/heads/master@{#414767} ==========
Message was sent while issue was closed.
Patchset 3 (id:??) landed as https://crrev.com/e73e1a1f69d9ac7dc8216f930544aa5650d74ef2 Cr-Commit-Position: refs/heads/master@{#414767}
Message was sent while issue was closed.
eugenebut@chromium.org changed reviewers: + eugenebut@chromium.org
Message was sent while issue was closed.
https://codereview.chromium.org/2283683002/diff/40001/DEPS File DEPS (right): https://codereview.chromium.org/2283683002/diff/40001/DEPS#newcode264 DEPS:264: Var('chromium_git') + '/external/github.com/google/cld_3.git' + '@' + 'f01672272dacc4cb3409f458ed61f7d4eb0f47de', This change breaks Chrome for iOS language detection test. Is it possible to revert the CL until we understand what went wrong?
Message was sent while issue was closed.
Just to update the thread -- Jackie, Eugene and I exchanged a few emails. This change to the language prediction model makes it more robust because unreliable pieces of the input are discarded but if the text ends up being too short, then it outputs "und" (or undefined/unknown). So, adding more reliable text to the test should fix this issue. Jackie suggested to have the test disabled and to check if there might be encoding issues with the test server. If that's not the case and adding more reliable text doesn't fix the problem, please let me know!
Message was sent while issue was closed.
I extended the sample text that we're using for detection, and the tests are now passing again. Thanks! Jackie On Fri, Aug 26, 2016 at 7:40 PM <abakalov@chromium.org> wrote: > Just to update the thread -- Jackie, Eugene and I exchanged a few emails. > > This change to the language prediction model makes it more robust because > unreliable pieces of the input are discarded but if the text ends up being > too > short, then it outputs "und" (or undefined/unknown). So, adding more > reliable > text to the test should fix this issue. > > Jackie suggested to have the test disabled and to check if there might be > encoding issues with the test server. If that's not the case and adding > more > reliable text doesn't fix the problem, please let me know! > > https://codereview.chromium.org/2283683002/ > -- You received this message because you are subscribed to the Google Groups "Chromium-reviews" group. To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org. |