|
|
Created:
5 years, 3 months ago by Peter Kasting Modified:
5 years, 3 months ago Reviewers:
msw CC:
chromium-reviews Base URL:
https://chromium.googlesource.com/chromium/src.git@master Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
DescriptionImprove text elision in TruncateString().
* When doing word-breaking in the middle of the first word in a string with
leading whitespace, we would incorrectly truncate the entire string.
* When doing word-breaking in whitespace after a word, we would incorrectly omit
that word from the truncated string.
This also adds more tests and organizes them a bit more readably.
BUG=none
TEST=Select up to but not including the 'f' in the string "aaaaaaaaa bbbbbbbbb ccccccccc ddddddddd eeeeeeeee f". Right-click the selection and look at the "Search Google for..." string. It should include 'e's.
Committed: https://crrev.com/fc336926514596ca46b1c1e2973f48c1b6766df1
Cr-Commit-Position: refs/heads/master@{#350310}
Patch Set 1 #
Total comments: 22
Patch Set 2 : Review comments #
Total comments: 3
Messages
Total messages: 18 (4 generated)
pkasting@chromium.org changed reviewers: + msw@chromium.org
https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider.cc File ui/gfx/text_elider.cc (right): https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider.cc#newco... ui/gfx/text_elider.cc:759: // String fits, return it. unrelated nit: please move the comment above the if statement or add curlies; ditto for the other two early returns below. https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider.cc#newco... ui/gfx/text_elider.cc:773: int32_t index = static_cast<int32_t>(length) - (word_break ? 0 : 1); I'd be a lot more comfortable with this code using iterators to set indices and avoiding +/-1 increments that might chop surrogates. It shouldn't be too hard to rewrite the char iterator loop to start at |length| instead of |length-1|, but the successful word break code path might need to return early; WDYT? https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider.cc#newco... ui/gfx/text_elider.cc:781: return string.substr(0, length - 1) + kElideString; It's unrelated (okay for follow-up), but I think this error case should still try to use the character iterator codepath below instead of potentially chopping a surrogate pair or similar. https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider.cc#newco... ui/gfx/text_elider.cc:783: index = bi->preceding(index); nit: consider using |index = bi->preceding(length);| here and just unconditionally initializing index to |length - 1| on line 773 https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider.cc#newco... ui/gfx/text_elider.cc:803: // this character, plus the ellipsis. nit: s/plus/and append/ https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider.cc#newco... ui/gfx/text_elider.cc:814: return string.substr(0, length - 1) + kElideString; This should use a char iterator to avoid chopping a surrogate pair. https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider_unittest.cc File ui/gfx/text_elider_unittest.cc (right): https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider_unittest... ui/gfx/text_elider_unittest.cc:1073: TEST(TextEliderTest, MAYBE_TruncateString) { Adding a test with surrogate pairs, etc. would be nice. https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider_unittest... ui/gfx/text_elider_unittest.cc:1081: EXPECT_EQ(L"\x2026", UTF16ToWide(TruncateString(str, 1, WORD_BREAK))); nit: maybe EXPECT_STREQ would make UTF16ToWide unnecessary? https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider_unittest... ui/gfx/text_elider_unittest.cc:1093: // Test breaking at the start of a later word. It's probably worthwhile to test index 12 too (but maybe that's mostly covered by testing index 5 below).
https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider.cc File ui/gfx/text_elider.cc (right): https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider.cc#newco... ui/gfx/text_elider.cc:773: int32_t index = static_cast<int32_t>(length) - (word_break ? 0 : 1); On 2015/09/17 18:42:44, msw wrote: > I'd be a lot more comfortable with this code using iterators to set indices and > avoiding +/-1 increments that might chop surrogates. It shouldn't be too hard to > rewrite the char iterator loop to start at |length| instead of |length-1|, but > the successful word break code path might need to return early; WDYT? I tried a couple times yesterday to rewrite the char iterator loop and couldn't do it correctly. Also, depending on whether you're getting into it from the word break loop or not, you want to start at (length) or (length - 1). Just changing that to (length + 1) and (length) doesn't help avoid potential surrogate breaking. I might be able to do a more extensive rewrite of this function that gets all this more right. I wouldn't mind addressing all your non-surrogate-breaking-related comments, checking this in (since the existing code already suffers from these issues, so we wouldn't be making them worse, but we would at least improve test coverage and fix some bugs), and then doing the rewrite pass as a second CL. Does that seem OK?
https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider.cc File ui/gfx/text_elider.cc (right): https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider.cc#newco... ui/gfx/text_elider.cc:773: int32_t index = static_cast<int32_t>(length) - (word_break ? 0 : 1); On 2015/09/17 19:39:03, Peter Kasting wrote: > On 2015/09/17 18:42:44, msw wrote: > > I'd be a lot more comfortable with this code using iterators to set indices > and > > avoiding +/-1 increments that might chop surrogates. It shouldn't be too hard > to > > rewrite the char iterator loop to start at |length| instead of |length-1|, but > > the successful word break code path might need to return early; WDYT? > > I tried a couple times yesterday to rewrite the char iterator loop and couldn't > do it correctly. Also, depending on whether you're getting into it from the > word break loop or not, you want to start at (length) or (length - 1). Just > changing that to (length + 1) and (length) doesn't help avoid potential > surrogate breaking. > > I might be able to do a more extensive rewrite of this function that gets all > this more right. I wouldn't mind addressing all your > non-surrogate-breaking-related comments, checking this in (since the existing > code already suffers from these issues, so we wouldn't be making them worse, but > we would at least improve test coverage and fix some bugs), and then doing the > rewrite pass as a second CL. Does that seem OK? That seems fine, except lines 804 and 814 seem to potentially introduce new opportunities to break surrogates.
https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider.cc File ui/gfx/text_elider.cc (right): https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider.cc#newco... ui/gfx/text_elider.cc:773: int32_t index = static_cast<int32_t>(length) - (word_break ? 0 : 1); On 2015/09/17 20:45:10, msw wrote: > On 2015/09/17 19:39:03, Peter Kasting wrote: > > On 2015/09/17 18:42:44, msw wrote: > > > I'd be a lot more comfortable with this code using iterators to set indices > > and > > > avoiding +/-1 increments that might chop surrogates. It shouldn't be too > hard > > to > > > rewrite the char iterator loop to start at |length| instead of |length-1|, > but > > > the successful word break code path might need to return early; WDYT? > > > > I tried a couple times yesterday to rewrite the char iterator loop and > couldn't > > do it correctly. Also, depending on whether you're getting into it from the > > word break loop or not, you want to start at (length) or (length - 1). Just > > changing that to (length + 1) and (length) doesn't help avoid potential > > surrogate breaking. > > > > I might be able to do a more extensive rewrite of this function that gets all > > this more right. I wouldn't mind addressing all your > > non-surrogate-breaking-related comments, checking this in (since the existing > > code already suffers from these issues, so we wouldn't be making them worse, > but > > we would at least improve test coverage and fix some bugs), and then doing the > > rewrite pass as a second CL. Does that seem OK? > > That seems fine, except lines 804 and 814 seem to potentially introduce new > opportunities to break surrogates. 804 is definitely a regression I need to fix. I meant to comment about that before and forgot, but basically I need to restore the use of char_iterator.next() there instead of adding 1. 814 can break a surrogate, but it's the same break point (length - 1) we already have a problem with a couple other places in this code. I don't really consider that a new bug; I think the right fix there is basically going to be the same as all the other places, which is pretty much, at the top of the function we need to use a character iterator to find the right character break before length, and then use that everywhere. So I'd rather fix that at the same time as I fix the similar instances.
https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider.cc File ui/gfx/text_elider.cc (right): https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider.cc#newco... ui/gfx/text_elider.cc:773: int32_t index = static_cast<int32_t>(length) - (word_break ? 0 : 1); On 2015/09/17 20:58:05, Peter Kasting wrote: > On 2015/09/17 20:45:10, msw wrote: > > On 2015/09/17 19:39:03, Peter Kasting wrote: > > > On 2015/09/17 18:42:44, msw wrote: > > > > I'd be a lot more comfortable with this code using iterators to set > indices > > > and > > > > avoiding +/-1 increments that might chop surrogates. It shouldn't be too > > hard > > > to > > > > rewrite the char iterator loop to start at |length| instead of |length-1|, > > but > > > > the successful word break code path might need to return early; WDYT? > > > > > > I tried a couple times yesterday to rewrite the char iterator loop and > > couldn't > > > do it correctly. Also, depending on whether you're getting into it from the > > > word break loop or not, you want to start at (length) or (length - 1). Just > > > changing that to (length + 1) and (length) doesn't help avoid potential > > > surrogate breaking. > > > > > > I might be able to do a more extensive rewrite of this function that gets > all > > > this more right. I wouldn't mind addressing all your > > > non-surrogate-breaking-related comments, checking this in (since the > existing > > > code already suffers from these issues, so we wouldn't be making them worse, > > but > > > we would at least improve test coverage and fix some bugs), and then doing > the > > > rewrite pass as a second CL. Does that seem OK? > > > > That seems fine, except lines 804 and 814 seem to potentially introduce new > > opportunities to break surrogates. > > 804 is definitely a regression I need to fix. I meant to comment about that > before and forgot, but basically I need to restore the use of > char_iterator.next() there instead of adding 1. > > 814 can break a surrogate, but it's the same break point (length - 1) we already > have a problem with a couple other places in this code. I don't really consider > that a new bug; I think the right fix there is basically going to be the same as > all the other places, which is pretty much, at the top of the function we need > to use a character iterator to find the right character break before length, and > then use that everywhere. So I'd rather fix that at the same time as I fix the > similar instances. OK, sgtm
PTAL https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider.cc File ui/gfx/text_elider.cc (right): https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider.cc#newco... ui/gfx/text_elider.cc:759: // String fits, return it. On 2015/09/17 18:42:44, msw wrote: > unrelated nit: please move the comment above the if statement or add curlies; > ditto for the other two early returns below. Fixed by exchanging these for some EOL comments (which I rewrote slightly). https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider.cc#newco... ui/gfx/text_elider.cc:773: int32_t index = static_cast<int32_t>(length) - (word_break ? 0 : 1); On 2015/09/17 21:06:26, msw wrote: > On 2015/09/17 20:58:05, Peter Kasting wrote: > > On 2015/09/17 20:45:10, msw wrote: > > > On 2015/09/17 19:39:03, Peter Kasting wrote: > > > > On 2015/09/17 18:42:44, msw wrote: > > > > > I'd be a lot more comfortable with this code using iterators to set > > indices > > > > and > > > > > avoiding +/-1 increments that might chop surrogates. It shouldn't be too > > > hard > > > > to > > > > > rewrite the char iterator loop to start at |length| instead of > |length-1|, > > > but > > > > > the successful word break code path might need to return early; WDYT? > > > > > > > > I tried a couple times yesterday to rewrite the char iterator loop and > > > couldn't > > > > do it correctly. Also, depending on whether you're getting into it from > the > > > > word break loop or not, you want to start at (length) or (length - 1). > Just > > > > changing that to (length + 1) and (length) doesn't help avoid potential > > > > surrogate breaking. > > > > > > > > I might be able to do a more extensive rewrite of this function that gets > > all > > > > this more right. I wouldn't mind addressing all your > > > > non-surrogate-breaking-related comments, checking this in (since the > > existing > > > > code already suffers from these issues, so we wouldn't be making them > worse, > > > but > > > > we would at least improve test coverage and fix some bugs), and then doing > > the > > > > rewrite pass as a second CL. Does that seem OK? > > > > > > That seems fine, except lines 804 and 814 seem to potentially introduce new > > > opportunities to break surrogates. > > > > 804 is definitely a regression I need to fix. I meant to comment about that > > before and forgot, but basically I need to restore the use of > > char_iterator.next() there instead of adding 1. > > > > 814 can break a surrogate, but it's the same break point (length - 1) we > already > > have a problem with a couple other places in this code. I don't really > consider > > that a new bug; I think the right fix there is basically going to be the same > as > > all the other places, which is pretty much, at the top of the function we need > > to use a character iterator to find the right character break before length, > and > > then use that everywhere. So I'd rather fix that at the same time as I fix > the > > similar instances. > > OK, sgtm Fixed line 804 by restoring the next() call. https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider.cc#newco... ui/gfx/text_elider.cc:781: return string.substr(0, length - 1) + kElideString; On 2015/09/17 18:42:44, msw wrote: > It's unrelated (okay for follow-up), but I think this error case should still > try to use the character iterator codepath below instead of potentially chopping > a surrogate pair or similar. Yep, will address in the followup. https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider.cc#newco... ui/gfx/text_elider.cc:783: index = bi->preceding(index); On 2015/09/17 18:42:44, msw wrote: > nit: consider using |index = bi->preceding(length);| here and just > unconditionally initializing index to |length - 1| on line 773 Done. https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider.cc#newco... ui/gfx/text_elider.cc:803: // this character, plus the ellipsis. On 2015/09/17 18:42:44, msw wrote: > nit: s/plus/and append/ Done. https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider_unittest.cc File ui/gfx/text_elider_unittest.cc (right): https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider_unittest... ui/gfx/text_elider_unittest.cc:1073: TEST(TextEliderTest, MAYBE_TruncateString) { On 2015/09/17 18:42:44, msw wrote: > Adding a test with surrogate pairs, etc. would be nice. Since the code can get those wrong today, let me do that as part of my next patch. I agree we should be testing them. https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider_unittest... ui/gfx/text_elider_unittest.cc:1081: EXPECT_EQ(L"\x2026", UTF16ToWide(TruncateString(str, 1, WORD_BREAK))); On 2015/09/17 18:42:44, msw wrote: > nit: maybe EXPECT_STREQ would make UTF16ToWide unnecessary? Nah, that just means you need to supply two pointers, and they have to be the same type. The issue is that on Mac there's no way to do a raw string16 pointer literal. https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider_unittest... ui/gfx/text_elider_unittest.cc:1093: // Test breaking at the start of a later word. On 2015/09/17 18:42:44, msw wrote: > It's probably worthwhile to test index 12 too (but maybe that's mostly covered > by testing index 5 below). Done. Expanded the tests below by a case as well.
lgtm with an optional nit/idea for followup test changes. https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider_unittest.cc File ui/gfx/text_elider_unittest.cc (right): https://codereview.chromium.org/1340423006/diff/1/ui/gfx/text_elider_unittest... ui/gfx/text_elider_unittest.cc:1073: TEST(TextEliderTest, MAYBE_TruncateString) { On 2015/09/19 02:43:25, Peter Kasting wrote: > On 2015/09/17 18:42:44, msw wrote: > > Adding a test with surrogate pairs, etc. would be nice. > > Since the code can get those wrong today, let me do that as part of my next > patch. I agree we should be testing them. Acknowledged. https://codereview.chromium.org/1340423006/diff/20001/ui/gfx/text_elider.cc File ui/gfx/text_elider.cc (right): https://codereview.chromium.org/1340423006/diff/20001/ui/gfx/text_elider.cc#n... ui/gfx/text_elider.cc:812: return string.substr(0, length - 1) + kElideString; This is the one new place that might break multi-char graphemes, but it would do so in the same manner as other index-1 defects above; it's okay for now. https://codereview.chromium.org/1340423006/diff/20001/ui/gfx/text_elider_unit... File ui/gfx/text_elider_unittest.cc (right): https://codereview.chromium.org/1340423006/diff/20001/ui/gfx/text_elider_unit... ui/gfx/text_elider_unittest.cc:1101: UTF16ToWide(TruncateString(str, 12, CHARACTER_BREAK))); optional nit: test index 13 too for completeness here? It'd be nice if these tests just looped over arrays of expected char/word breaks for each length...
https://codereview.chromium.org/1340423006/diff/20001/ui/gfx/text_elider_unit... File ui/gfx/text_elider_unittest.cc (right): https://codereview.chromium.org/1340423006/diff/20001/ui/gfx/text_elider_unit... ui/gfx/text_elider_unittest.cc:1101: UTF16ToWide(TruncateString(str, 12, CHARACTER_BREAK))); On 2015/09/21 18:28:10, msw wrote: > optional nit: test index 13 too for completeness here? The (str2, 5) case below checks this particular type of breaking. > It'd be nice if these > tests just looped over arrays of expected char/word breaks for each length... Yeah, I think that's a good idea. I'll look at that when I do the stage 2 patch here.
The CQ bit was checked by pkasting@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1340423006/20001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1340423006/20001
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: win8_chromium_ng on tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win8_chromium_ng/...)
The CQ bit was checked by pkasting@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1340423006/20001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1340423006/20001
Message was sent while issue was closed.
Committed patchset #2 (id:20001)
Message was sent while issue was closed.
Patchset 2 (id:??) landed as https://crrev.com/fc336926514596ca46b1c1e2973f48c1b6766df1 Cr-Commit-Position: refs/heads/master@{#350310} |