|
|
Created:
5 years ago by kojii Modified:
5 years ago CC:
chromium-reviews, pdr+graphicswatchlist_chromium.org, drott+blinkwatch_chromium.org, Rik, blink-reviews-platform-graphics_chromium.org, dshwang, jbroman, Justin Novosad, danakj, krit, f(malita), blink-reviews, vmpstr+blinkwatch_chromium.org, Stephen Chennney, rwlbuis Base URL:
https://chromium.googlesource.com/chromium/src.git@master Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
DescriptionChange CachingWordShapeIterator to delimit CJK characters
This patch changes CachingWordShapeIterator to handle one CJK character
as a word for the caching purpose.
Since these scripts do not delimit words by spaces, text in these
scripts do not perform well with the CachingWordShaper. However, since
these scripts have break opportunities between almost every character,
and since the line breaker, the most performance critical user of
CachingWordShaper, splits text at break opportunities, this
characteristic is not critical for the layout purpose.
There are, however, other cases where this characteristic hits the
performance negatively.
The <select> element measures the width of all text in its <option>
elements to determine its width, and since each <option> usually has
unique text, this characteristic of CachingWordShaper slows it down
significantly for CJK text. A benchmark of 2352 option elements with
CJK text was slower than ASCII by 6-7 times. This patch improves it to
10-20% slower than ASCII on debug builds.
There are other callers of HarfBuzzShaper with long text, such as
paint, and this patch improves all of them for CJK scripts.
BUG=565902, 570229
Committed: https://crrev.com/1494aab723061f943998a92653c41b6f7f8fa01f
Cr-Commit-Position: refs/heads/master@{#366421}
Patch Set 1 #Patch Set 2 : Fix allowTabs case #Patch Set 3 : Avoid split before COMMON/INHERITED #Patch Set 4 : Some fixes, cleanup, and more tests #Patch Set 5 : Minor fixes, more tests, and cleanup #Patch Set 6 : TestExpectations #
Total comments: 2
Patch Set 7 : Refactor to clarify what nextWord does and less functions #
Total comments: 4
Patch Set 8 : Add test, rebase, TestExpectations #Patch Set 9 : TestExpectations #
Depends on Patchset: Messages
Total messages: 29 (11 generated)
Description was changed from ========== Change CachingWordShapeIterator to handle one CJK character as a word This patch changes CachingWordShapeIterator to handle one CJK character as a word for the caching purpose. Since these scripts do not delimit words by spaces, text in these scripts do not perform well with the CachingWordShaper. However, since these scripts have break opportunities between almost every character, and since the line breaker, the most performance critical user of CachingWordShaper, splits text by break opportunities, this characteristic is not critical for the layout purpose. The <select> element measures the width of all text in its <option> elements to determine its width, and since each <option> usually has unique text, this characteristic of CachingWordShaper slows it down when it has many <option> elements with CJK text. A benchmark of 2352 option elements with CJK text was slower than ASCII by 6~7 times. This patch improves it to 10~20% slower on debug builds. BUG=565902 patch from issue 1525963002 at patchset 60001 (http://crrev.com/1525963002#ps60001) ========== to ========== Change CachingWordShapeIterator to handle one CJK character as a word This patch changes CachingWordShapeIterator to handle one CJK character as a word for the caching purpose. Since these scripts do not delimit words by spaces, text in these scripts do not perform well with the CachingWordShaper. However, since these scripts have break opportunities between almost every character, and since the line breaker, the most performance critical user of CachingWordShaper, splits text by break opportunities, this characteristic is not critical for the layout purpose. The <select> element measures the width of all text in its <option> elements to determine its width, and since each <option> usually has unique text, this characteristic of CachingWordShaper slows it down when it has many <option> elements with CJK text. A benchmark of 2352 option elements with CJK text was slower than ASCII by 6~7 times. This patch improves it to 10~20% slower on debug builds. BUG=565902 patch from issue 1525963002 at patchset 60001 (http://crrev.com/1525963002#ps60001) ==========
Patchset #1 (id:1) has been deleted
kojii@chromium.org changed reviewers: + drott@chromium.org, eae@chromium.org
Still WIP, but script/font/fallback seemed to change and you might be interested in. From the PS3 test results: 1. wrap-CJK-001.html on Win is particularly interesting. Kana characters after "13" was using wrong font (script was determined as en?), but this CL fixes it. 2. Some Emoji on Win/Mac fail to render as Emoji. Haven't figured out why. 3. Some Tofu on Win/Linux turned to the correct glyphs. Haven't figured out why, maybe related with 1. A good change, but why we pick Tofu in expected image and why only some are fixed needs to be figured out. 4. Slight width changes on Mac. I think this is the right thing, we used to use different segmentation for measure and render if CJK, but with this patch, we're closer. So in general, this CL at this point improves fonts/scripts/fallback for CJK, but degrades Emoji a bit.
> 2. Some Emoji on Win/Mac fail to render as Emoji. Haven't figured out why. Emoji failures are due to http://crbug.com/570647 so I will not worry in this CL.
Patchset #4 (id:80001) has been deleted
Description was changed from ========== Change CachingWordShapeIterator to handle one CJK character as a word This patch changes CachingWordShapeIterator to handle one CJK character as a word for the caching purpose. Since these scripts do not delimit words by spaces, text in these scripts do not perform well with the CachingWordShaper. However, since these scripts have break opportunities between almost every character, and since the line breaker, the most performance critical user of CachingWordShaper, splits text by break opportunities, this characteristic is not critical for the layout purpose. The <select> element measures the width of all text in its <option> elements to determine its width, and since each <option> usually has unique text, this characteristic of CachingWordShaper slows it down when it has many <option> elements with CJK text. A benchmark of 2352 option elements with CJK text was slower than ASCII by 6~7 times. This patch improves it to 10~20% slower on debug builds. BUG=565902 patch from issue 1525963002 at patchset 60001 (http://crrev.com/1525963002#ps60001) ========== to ========== Change CachingWordShapeIterator to handle one CJK character as a word This patch changes CachingWordShapeIterator to handle one CJK character as a word for the caching purpose. Since these scripts do not delimit words by spaces, text in these scripts do not perform well with the CachingWordShaper. However, since these scripts have break opportunities between almost every character, and since the line breaker, the most performance critical user of CachingWordShaper, splits text at break opportunities, this characteristic is not critical for the layout purpose. There are, however, other cases where this characteristic hits the performance, and this patch improves such cases. The <select> element measures the width of all text in its <option> elements to determine its width, and since each <option> usually has unique text, this characteristic of CachingWordShaper slows it down significantly for CJK text. A benchmark of 2352 option elements with CJK text was slower than ASCII by 6-7 times. This patch improves it to 10-20% slower than ASCII on debug builds. There are other callers of HarfBuzzShaper with long text, such as paint, and this patch improves all of them for CJK scripts. BUG=565902, 570229 ==========
Description was changed from ========== Change CachingWordShapeIterator to handle one CJK character as a word This patch changes CachingWordShapeIterator to handle one CJK character as a word for the caching purpose. Since these scripts do not delimit words by spaces, text in these scripts do not perform well with the CachingWordShaper. However, since these scripts have break opportunities between almost every character, and since the line breaker, the most performance critical user of CachingWordShaper, splits text at break opportunities, this characteristic is not critical for the layout purpose. There are, however, other cases where this characteristic hits the performance, and this patch improves such cases. The <select> element measures the width of all text in its <option> elements to determine its width, and since each <option> usually has unique text, this characteristic of CachingWordShaper slows it down significantly for CJK text. A benchmark of 2352 option elements with CJK text was slower than ASCII by 6-7 times. This patch improves it to 10-20% slower than ASCII on debug builds. There are other callers of HarfBuzzShaper with long text, such as paint, and this patch improves all of them for CJK scripts. BUG=565902, 570229 ========== to ========== Change CachingWordShapeIterator to delimit CJK characters This patch changes CachingWordShapeIterator to handle one CJK character as a word for the caching purpose. Since these scripts do not delimit words by spaces, text in these scripts do not perform well with the CachingWordShaper. However, since these scripts have break opportunities between almost every character, and since the line breaker, the most performance critical user of CachingWordShaper, splits text at break opportunities, this characteristic is not critical for the layout purpose. There are, however, other cases where this characteristic hits the performance, and this patch improves such cases. The <select> element measures the width of all text in its <option> elements to determine its width, and since each <option> usually has unique text, this characteristic of CachingWordShaper slows it down significantly for CJK text. A benchmark of 2352 option elements with CJK text was slower than ASCII by 6-7 times. This patch improves it to 10-20% slower than ASCII on debug builds. There are other callers of HarfBuzzShaper with long text, such as paint, and this patch improves all of them for CJK scripts. BUG=565902, 570229 ==========
Description was changed from ========== Change CachingWordShapeIterator to delimit CJK characters This patch changes CachingWordShapeIterator to handle one CJK character as a word for the caching purpose. Since these scripts do not delimit words by spaces, text in these scripts do not perform well with the CachingWordShaper. However, since these scripts have break opportunities between almost every character, and since the line breaker, the most performance critical user of CachingWordShaper, splits text at break opportunities, this characteristic is not critical for the layout purpose. There are, however, other cases where this characteristic hits the performance, and this patch improves such cases. The <select> element measures the width of all text in its <option> elements to determine its width, and since each <option> usually has unique text, this characteristic of CachingWordShaper slows it down significantly for CJK text. A benchmark of 2352 option elements with CJK text was slower than ASCII by 6-7 times. This patch improves it to 10-20% slower than ASCII on debug builds. There are other callers of HarfBuzzShaper with long text, such as paint, and this patch improves all of them for CJK scripts. BUG=565902, 570229 ========== to ========== Change CachingWordShapeIterator to delimit CJK characters This patch changes CachingWordShapeIterator to handle one CJK character as a word for the caching purpose. Since these scripts do not delimit words by spaces, text in these scripts do not perform well with the CachingWordShaper. However, since these scripts have break opportunities between almost every character, and since the line breaker, the most performance critical user of CachingWordShaper, splits text at break opportunities, this characteristic is not critical for the layout purpose. There are, however, other cases where this characteristic hits the performance negatively. The <select> element measures the width of all text in its <option> elements to determine its width, and since each <option> usually has unique text, this characteristic of CachingWordShaper slows it down significantly for CJK text. A benchmark of 2352 option elements with CJK text was slower than ASCII by 6-7 times. This patch improves it to 10-20% slower than ASCII on debug builds. There are other callers of HarfBuzzShaper with long text, such as paint, and this patch improves all of them for CJK scripts. BUG=565902, 570229 ==========
eae@, PTAL.
On 2015/12/20 16:14:32, kojii wrote: > eae@, PTAL. Forgot to note that this CL has a depends on CL, appreciate to look at both.
Thanks for trying to improve the caching mechanism for CJK, Koji. It's a bit hard to do this via Rietveld, working on this together in person might be easier. But I am trying to put some of my thoughts and improvement suggestions in the review comment below. Feel free to ping me on Hangouts to discuss further. https://codereview.chromium.org/1525993005/diff/140001/third_party/WebKit/Sou... File third_party/WebKit/Source/platform/fonts/shaping/CachingWordShapeIterator.h (right): https://codereview.chromium.org/1525993005/diff/140001/third_party/WebKit/Sou... third_party/WebKit/Source/platform/fonts/shaping/CachingWordShapeIterator.h:121: // Delimit every CJK character because these scripts do not delimit AFAICS we have now three modes in this file: Separate by spaces, separate for tabulation, and separate by CJK codepoints + combining marks. The segmentation logic is getting really hard to read and some member variables are starting to be wrongly named and misleading. Here we're in "nextWordRange()" but what is returned is not a word range but a CJK "extended" character. It's really hard to follow from bool next(RefPtr<ShapeResult>* wordResult) to how the text actually gets split and I'd hope we can improve that. For example in line 62 we make a decision on whether we can use caching and use "m_shapeByWord". Can the logic for whether this is a CJK run be put there? Could we perhaps make this into a decider which breaking logic to use? m_shapeByMode out of { Spaces, Tabulation, CJK characters } or so? It's hard to keep track of what is going on if the m_shapeByWord variable does not actually tell what breaking strategy is used. Can we somehow reflect the breaking strategy be more clearly in the code, then use the breaking strategy right from the next() method, instead of branching from nextWord to nextWordRangeForCJK()?
https://codereview.chromium.org/1525993005/diff/140001/third_party/WebKit/Sou... File third_party/WebKit/Source/platform/fonts/shaping/CachingWordShapeIterator.h (right): https://codereview.chromium.org/1525993005/diff/140001/third_party/WebKit/Sou... third_party/WebKit/Source/platform/fonts/shaping/CachingWordShapeIterator.h:121: // Delimit every CJK character because these scripts do not delimit On 2015/12/21 10:36:01, drott wrote: > AFAICS we have now three modes in this file: Separate by spaces, separate for > tabulation, and separate by CJK codepoints + combining marks. I think that's not the correct way to see the options; I'd say: 1. We have tabulation mode on/off 2. We have word caching on/off 3. We have script-based word segmentation, currently support two flavors by code points. 1 and 2 are independent flags that can be combined any way. > The segmentation logic is getting really hard to read and some member variables > are starting to be wrongly named and misleading. Here we're in "nextWordRange()" > but what is returned is not a word range but a CJK "extended" character. It's > really hard to follow from bool next(RefPtr<ShapeResult>* wordResult) to how the > text actually gets split and I'd hope we can improve that. > > For example in line 62 we make a decision on whether we can use caching and use > "m_shapeByWord". Can the logic for whether this is a CJK run be put there? Could > we perhaps make this into a decider which breaking logic to use? m_shapeByMode > out of { Spaces, Tabulation, CJK characters } or so? As above, that's not correct, because there are Tabulation + Spaces and Tabulation + CJK. I agree that the level of calls are deep. I can merge nextWordRangeForCJK() to nextWordRnage() if you prefer; it used to be so until the patch before, but separated thinking non-CJK might not want to see detailed CJK logic. But doing so can decrease one level. > It's hard to keep track of > what is going on if the m_shapeByWord variable does not actually tell what > breaking strategy is used. I think that's conceptual misunderstanding. m_shapeByWord indicates whether word caching is possible or not. Currently we don't set it properly for CJK, assuming its low priority, but it may be false for CJK glyphs too in future. When word caching is enabled, we need script-based segmentation. I'd like to make it as minimum as possible, but as you know, word segmentation with correct i18n support is very hard. For instance, this logic does not support South Asian scripts yet. It's not raised as an important issue so I'm currently ignoring, but just want your understanding that the current word segmentation is very simple. > Can we somehow reflect the breaking strategy be more clearly in the code, then > use the breaking strategy right from the next() method, instead of branching > from nextWord to nextWordRangeForCJK()? As I said, I can merge the two functions. I'll try it in the next patch. But I wish your understanding, "segmenting to word" **is** a very complex problem, we're trying to implement its minimal subset here, but the minimal is probably more complicated than we might want it to be when i18n is involved.
On 2015/12/21 11:59:56, kojii wrote: > https://codereview.chromium.org/1525993005/diff/140001/third_party/WebKit/Sou... > File third_party/WebKit/Source/platform/fonts/shaping/CachingWordShapeIterator.h > (right): > > https://codereview.chromium.org/1525993005/diff/140001/third_party/WebKit/Sou... > third_party/WebKit/Source/platform/fonts/shaping/CachingWordShapeIterator.h:121: > // Delimit every CJK character because these scripts do not delimit > On 2015/12/21 10:36:01, drott wrote: > > AFAICS we have now three modes in this file: Separate by spaces, separate for > > tabulation, and separate by CJK codepoints + combining marks. > > I think that's not the correct way to see the options; I'd say: > 1. We have tabulation mode on/off > 2. We have word caching on/off > 3. We have script-based word segmentation, currently support two flavors by code > points. So, by 3) you mean: A splitting logic that is different per script? Okay - so you would call the split-off CJK "units" "words" as well? That's what irritates me a bit - since as far as I understand they are not words really, but rather character sequences, potentially a part of a word? > 1 and 2 are independent flags that can be combined any way. > > > The segmentation logic is getting really hard to read and some member > variables > > are starting to be wrongly named and misleading. Here we're in > "nextWordRange()" > > but what is returned is not a word range but a CJK "extended" character. It's > > really hard to follow from bool next(RefPtr<ShapeResult>* wordResult) to how > the > > text actually gets split and I'd hope we can improve that. > > > > For example in line 62 we make a decision on whether we can use caching and > use > > "m_shapeByWord". Can the logic for whether this is a CJK run be put there? > Could > > we perhaps make this into a decider which breaking logic to use? m_shapeByMode > > > out of { Spaces, Tabulation, CJK characters } or so? > > As above, that's not correct, because there are Tabulation + Spaces and > Tabulation + CJK. Okay, thanks for the clarification. > I agree that the level of calls are deep. I can merge nextWordRangeForCJK() to > nextWordRnage() if you prefer; it used to be so until the patch before, but > separated thinking non-CJK might not want to see detailed CJK logic. But doing > so can decrease one level. > > > It's hard to keep track of > > what is going on if the m_shapeByWord variable does not actually tell what > > breaking strategy is used. > > I think that's conceptual misunderstanding. m_shapeByWord indicates whether word > caching is possible or not. Currently we don't set it properly for CJK, assuming > its low priority, but it may be false for CJK glyphs too in future. The meaning of the flag is clear. Perhaps I didn't express myself well. IMO, the naming could be improved, since in CJK the individual units are not really "words" IIUC. But I see what you mean regarding the combinations of modes. > When word caching is enabled, we need script-based segmentation. I'd like to > make it as minimum as possible, but as you know, word segmentation with correct > i18n support is very hard. For instance, this logic does not support South Asian > scripts yet. It's not raised as an important issue so I'm currently ignoring, > but just want your understanding that the current word segmentation is very > simple. There is a well-tested ScriptRunIterator which we use for the shaping step, is used in RunSegmenter.cpp to segment for shaping. - In fact, it would be great if we have only script, small caps, and orientation resolved entries in the word cache, which is the output of the RunSegmenter. The sub-runs returned by this step are suitable for shaping. Would it make sense to move the caching below this segmentation step? > > Can we somehow reflect the breaking strategy be more clearly in the code, then > > use the breaking strategy right from the next() method, instead of branching > > from nextWord to nextWordRangeForCJK()? > > As I said, I can merge the two functions. I'll try it in the next patch. > > But I wish your understanding, "segmenting to word" **is** a very complex > problem, we're trying to implement its minimal subset here, but the minimal is > probably more complicated than we might want it to be when i18n is involved. Word separation is complex, yes - I am not sure what you mean by "but the minimal is probably more complicated than we might want it to be when i18n is involved."
LGTM, I find this clearer. (With a nit on adding a common-only test case). Sorry if there was some back and forth regarding how to refactor. I'd still like to discuss whether we should generally have the full script segmentation step before the word cache splitting. What do you think? https://codereview.chromium.org/1525993005/diff/160001/third_party/WebKit/Sou... File third_party/WebKit/Source/platform/fonts/shaping/CachingWordShapeIterator.h (right): https://codereview.chromium.org/1525993005/diff/160001/third_party/WebKit/Sou... third_party/WebKit/Source/platform/fonts/shaping/CachingWordShapeIterator.h:150: || (!m_textRun.is8Bit() Am I understanding correctly that this covers the case when the first character of a word was non-CJK and we encounter the first CJK character? https://codereview.chromium.org/1525993005/diff/160001/third_party/WebKit/Sou... File third_party/WebKit/Source/platform/fonts/shaping/CachingWordShaperTest.cpp (right): https://codereview.chromium.org/1525993005/diff/160001/third_party/WebKit/Sou... third_party/WebKit/Source/platform/fonts/shaping/CachingWordShaperTest.cpp:299: TEST_F(CachingWordShaperTest, SegmentCJKCommonAndNonCJK) Could you add one test with only COMMON, no actual script values? IIUC we would expect to use the full length of the run then.
Refactored in PS7 given your comment, along with the other CL. This time, the focus is not to modular too much, but: * Easier to read what nextWord() does. * Avoid too deep functions when sharing the code does not give much values. * Less diff for easier review. Hope this aligns to your comments. More replies follow inline: On 2015/12/21 12:26:03, drott wrote: > > > > I think that's not the correct way to see the options; I'd say: > > 1. We have tabulation mode on/off > > 2. We have word caching on/off > > 3. We have script-based word segmentation, currently support two flavors by > code > > points. > > So, by 3) you mean: A splitting logic that is different per script? Okay - so > you would call the split-off CJK "units" "words" as well? That's what irritates > me a bit - since as far as I understand they are not words really, but rather > character sequences, potentially a part of a word? The "word" is ambiguous by definition. For example, when an input "Hello." was given, this is a word for CachingWordShaper but is two words for the selection. When performance is desired, defining a single CJK character as a word is widely used technique. There are other variations to quickly determine a "word" for CJK, but for this purpose, I think this is the best one. True that it's not a "word" in morphological analysis, but we're defining what a "word" is for CachingWordShaper, right? I'm just changing the definition to suit better for our goal. > The meaning of the flag is clear. Perhaps I didn't express myself well. IMO, the > naming could be improved, since in CJK the individual units are not really > "words" IIUC. But I see what you mean regarding the combinations of modes. Call it a "unit" instead, since it's really not a "word" for Latin either? I think it's more accurate, but is harder to understand for whoever does not need that much details. i18n-friendly "word" is defined accordingly to its purpose, and could have multiple definitions within a single product. We don't rename "word-selection" to "unit-selection". Unicode Text Segmentation[1] has "word" but it's never be a morphological word for Asians. In these cases, "word" is higher-level concept that mimics what a "word" is for Latin. [1] http://unicode.org/reports/tr29/ > There is a well-tested ScriptRunIterator which we use for the shaping step, is > used in RunSegmenter.cpp to segment for shaping. - In fact, it would be great if > we have only script, small caps, and orientation resolved entries in the word > cache, which is the output of the RunSegmenter. The sub-runs returned by this > step are suitable for shaping. Would it make sense to move the caching below > this segmentation step? Ah, that's a good idea. I haven't read RunSegmenter well enough to determine if it's a good idea or not yet, but that's a good candidate to improve this further. Another idea I was thinking was to run the line breaker. Since the line breaker is the heaviest user of width(), if all other inputs are aligned to the line breaker, the cache should perform the best. One more thing. When I was looking at test failures of early patches in this CL, I noticed a design issue that, since the line breaker and paint gives different runs to CachingWordShaper (line breaker splits at break opportunities,) we determine different scripts/fallback fonts for the line breaker and paint. This is probably the root cause of slight changes in pixels. I haven't come up with how to solve this yet, hope to discuss more with you and Emil, but we need to land this fix much earlier. As you might be aware of, multiple people are asking me to revert the complex path due to this issue, and this is currently beta-blocker. People complains that because of this, other features are not getting enough dogfood. Can we try to ease their pains first? > > > Can we somehow reflect the breaking strategy be more clearly in the code, > then > > > use the breaking strategy right from the next() method, instead of branching > > > from nextWord to nextWordRangeForCJK()? > > > > As I said, I can merge the two functions. I'll try it in the next patch. > > > > But I wish your understanding, "segmenting to word" **is** a very complex > > problem, we're trying to implement its minimal subset here, but the minimal is > > probably more complicated than we might want it to be when i18n is involved. > > Word separation is complex, yes - I am not sure what you mean by "but the > minimal is > probably more complicated than we might want it to be when i18n is involved." Not much, sorry, please disregard that part. I haven't come up a good plan for Thai yet, wondering that might be an issue sometime in future and this code may be a bit more complicated when it happens. But nothing concrete yet.
On 2015/12/21 14:57:42, drott wrote: > LGTM, I find this clearer. (With a nit on adding a common-only test case). Sorry > if there was some back and forth regarding how to refactor. Not at all, I think this is an improvement. Could you also look at the other one? https://codereview.chromium.org/1530833002/ or should I ask Emil when he wakes up for that one? > I'd still like to discuss whether we should generally have the full script > segmentation step before the word cache splitting. What do you think? Completely agreed, please refer to my comments I wrote while you were writing ;) https://codereview.chromium.org/1525993005/diff/160001/third_party/WebKit/Sou... File third_party/WebKit/Source/platform/fonts/shaping/CachingWordShapeIterator.h (right): https://codereview.chromium.org/1525993005/diff/160001/third_party/WebKit/Sou... third_party/WebKit/Source/platform/fonts/shaping/CachingWordShapeIterator.h:150: || (!m_textRun.is8Bit() On 2015/12/21 14:57:41, drott wrote: > Am I understanding correctly that this covers the case when the first character > of a word was non-CJK and we encounter the first CJK character? Yes, that is correct. SegmentCJKByCharacter() has a test for this case. https://codereview.chromium.org/1525993005/diff/160001/third_party/WebKit/Sou... File third_party/WebKit/Source/platform/fonts/shaping/CachingWordShaperTest.cpp (right): https://codereview.chromium.org/1525993005/diff/160001/third_party/WebKit/Sou... third_party/WebKit/Source/platform/fonts/shaping/CachingWordShaperTest.cpp:299: TEST_F(CachingWordShaperTest, SegmentCJKCommonAndNonCJK) On 2015/12/21 14:57:41, drott wrote: > Could you add one test with only COMMON, no actual script values? IIUC we would > expect to use the full length of the run then. Done.
On 2015/12/21 15:02:59, kojii wrote: > Refactored in PS7 given your comment, along with the other CL. This time, the > focus is not to modular too much, but: > * Easier to read what nextWord() does. > * Avoid too deep functions when sharing the code does not give much values. > * Less diff for easier review. > Hope this aligns to your comments. Yes, thanks. What do you mean by "along with the other CL" - is this still dependent on the other CL or is it all in one CL now? > i18n-friendly "word" is defined accordingly to its purpose, and could have > multiple definitions within a single product. We don't rename "word-selection" > to "unit-selection". Unicode Text Segmentation[1] has "word" but it's never be a > morphological word for Asians. In these cases, "word" is higher-level concept > that mimics what a "word" is for Latin. Thanks for the clarification - yes, I am okay with calling it word in this context, for lack of a better word :-) for it. > > There is a well-tested ScriptRunIterator which we use for the shaping step, is > > used in RunSegmenter.cpp to segment for shaping. - In fact, it would be great > if > > we have only script, small caps, and orientation resolved entries in the word > > cache, which is the output of the RunSegmenter. The sub-runs returned by this > > step are suitable for shaping. Would it make sense to move the caching below > > this segmentation step? > > Ah, that's a good idea. I haven't read RunSegmenter well enough to determine if > it's a good idea or not yet, but that's a good candidate to improve this > further. > > Another idea I was thinking was to run the line breaker. Since the line breaker > is the heaviest user of width(), if all other inputs are aligned to the line > breaker, the cache should perform the best. I don't think I understand this suggestion. Could you explain a bit more? We're having the WordCache to make the line breaker's job easier. If you mean something like a "line cache", I think we've seen that this is rather ineffective in many cases (does not work well for re-flows, etc.) and it's similar to the "run cache" we had before, which did not lead to the expected performance improvements. > One more thing. When I was looking at test failures of early patches in this CL, > I noticed a design issue that, since the line breaker and paint gives different > runs to CachingWordShaper (line breaker splits at break opportunities,) we > determine different scripts/fallback fonts for the line breaker and paint. This > is probably the root cause of slight changes in pixels. That sounds nasty, would you have an example for such a case? > I haven't come up with how to solve this yet, hope to discuss more with you and > Emil, One of my thoughts is that it would help to have segmentation info higher up, closer to TextRun and keep it there for all further text processing > ... but we need to land this fix much earlier. As you might be aware of, > multiple people are asking me to revert the complex path due to this issue, and > this is currently beta-blocker. People complains that because of this, other > features are not getting enough dogfood. > Can we try to ease their pains first? Yes, I think we can land this change. I'd like to hear eae@'s opinion as well, but I think we can land this now and revert if he has concerns with this change.
On 2015/12/21 15:35:11, drott wrote: > On 2015/12/21 15:02:59, kojii wrote: > > Refactored in PS7 given your comment, along with the other CL. This time, the > > focus is not to modular too much, but: > > * Easier to read what nextWord() does. > > * Avoid too deep functions when sharing the code does not give much values. > > * Less diff for easier review. > > Hope this aligns to your comments. > > Yes, thanks. What do you mean by "along with the other CL" - is this still > dependent on the other CL or is it all in one CL now? Thanks, I just found you reviewed the other one. > > > There is a well-tested ScriptRunIterator which we use for the shaping step, > is > > > used in RunSegmenter.cpp to segment for shaping. - In fact, it would be > great > > if > > > we have only script, small caps, and orientation resolved entries in the > word > > > cache, which is the output of the RunSegmenter. The sub-runs returned by > this > > > step are suitable for shaping. Would it make sense to move the caching below > > > this segmentation step? > > > > Ah, that's a good idea. I haven't read RunSegmenter well enough to determine > if > > it's a good idea or not yet, but that's a good candidate to improve this > > further. > > > > Another idea I was thinking was to run the line breaker. Since the line > breaker > > is the heaviest user of width(), if all other inputs are aligned to the line > > breaker, the cache should perform the best. > > I don't think I understand this suggestion. Could you explain a bit more? We're > having the WordCache to make the line breaker's job easier. If you mean > something like a "line cache", I think we've seen that this is rather > ineffective in many cases (does not work well for re-flows, etc.) and it's > similar to the "run cache" we had before, which did not lead to the expected > performance improvements. No, sorry, I didn't mean that. I was actually the same understanding -- when the line breaker performs well, other usage of CachingWordShaper is low enough. This issue proved that it's not the case. So one way to solve that is that, if all other uses of CachingWordShaper pass the similar input as the line breaker, we can optimize us for the line breaker and others should simply follow. i.e., somehow split out the line breaker for a string, call it to break the input string, and then give the result to CachingWordShaper. This can solve Thai and some Indic scripts, but I'm not confident that's the best way or not yet. > > One more thing. When I was looking at test failures of early patches in this > CL, > > I noticed a design issue that, since the line breaker and paint gives > different > > runs to CachingWordShaper (line breaker splits at break opportunities,) we > > determine different scripts/fallback fonts for the line breaker and paint. > This > > is probably the root cause of slight changes in pixels. > > That sounds nasty, would you have an example for such a case? Maybe these are simply bugs, but this CL is changing some paint results. One case is that "13inch" where "inch" is Katakana, and no space between them. Line breaker splits "13" and "inch", while paint gives the whole string. In this case, the "inch" picks different font from other Katakana. https://storage.googleapis.com/chromium-layout-test-archives/win_chromium_rel... Another case is when the first CJK character of a line does not have a glyph, all following CJK glyphs are also Tofu. This CL improves that. https://storage.googleapis.com/chromium-layout-test-archives/linux_chromium_r... Logically speaking, this could happen to Latin as well. "!" and "<" can break between. See our line-break matrix here: http://kojiishi.github.io/line-break-matrix/ It could be even worse when "word-break: break-all" is used. I think we can summary to two issues, though not sure how much important they are atm: 1. When line-breaker or any other callers split more text than we need to determine scripts/fonts, we fail. 2. Since line-breaker and paints split text differently, we may calculate differently for them. > > I haven't come up with how to solve this yet, hope to discuss more with you > and > > Emil, > > One of my thoughts is that it would help to have segmentation info higher up, > closer to TextRun and keep it there for all further text processing Agreed. I wish it be even higher than TextRun, but not sure where it should be yet. > > ... but we need to land this fix much earlier. As you might be aware of, > > multiple people are asking me to revert the complex path due to this issue, > and > > this is currently beta-blocker. People complains that because of this, other > > features are not getting enough dogfood. > > > > Can we try to ease their pains first? > > Yes, I think we can land this change. I'd like to hear eae@'s opinion as well, > but I think we can land this now and revert if he has concerns with this change. Thanks a lot!
> Maybe these are simply bugs, but this CL is changing some paint results. > > One case is that "13inch" where "inch" is Katakana, and no space between them. > Line breaker splits "13" and "inch", while paint gives the whole string. In this > case, the "inch" picks different font from other Katakana. > https://storage.googleapis.com/chromium-layout-test-archives/win_chromium_rel... > > Another case is when the first CJK character of a line does not have a glyph, > all following CJK glyphs are also Tofu. This CL improves that. > https://storage.googleapis.com/chromium-layout-test-archives/linux_chromium_r... > > Logically speaking, this could happen to Latin as well. "!" and "<" can break > between. See our line-break matrix here: > http://kojiishi.github.io/line-break-matrix/ > > It could be even worse when "word-break: break-all" is used. > > I think we can summary to two issues, though not sure how much important they > are atm: > 1. When line-breaker or any other callers split more text than we need to > determine scripts/fonts, we fail. > 2. Since line-breaker and paints split text differently, we may calculate > differently for them. Could you file a bug for this? I think this is an important case to look at.
The CQ bit was checked by kojii@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from drott@chromium.org Link to the patchset: https://codereview.chromium.org/1525993005/#ps200001 (title: "TestExpectations")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1525993005/200001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1525993005/200001
Message was sent while issue was closed.
Description was changed from ========== Change CachingWordShapeIterator to delimit CJK characters This patch changes CachingWordShapeIterator to handle one CJK character as a word for the caching purpose. Since these scripts do not delimit words by spaces, text in these scripts do not perform well with the CachingWordShaper. However, since these scripts have break opportunities between almost every character, and since the line breaker, the most performance critical user of CachingWordShaper, splits text at break opportunities, this characteristic is not critical for the layout purpose. There are, however, other cases where this characteristic hits the performance negatively. The <select> element measures the width of all text in its <option> elements to determine its width, and since each <option> usually has unique text, this characteristic of CachingWordShaper slows it down significantly for CJK text. A benchmark of 2352 option elements with CJK text was slower than ASCII by 6-7 times. This patch improves it to 10-20% slower than ASCII on debug builds. There are other callers of HarfBuzzShaper with long text, such as paint, and this patch improves all of them for CJK scripts. BUG=565902, 570229 ========== to ========== Change CachingWordShapeIterator to delimit CJK characters This patch changes CachingWordShapeIterator to handle one CJK character as a word for the caching purpose. Since these scripts do not delimit words by spaces, text in these scripts do not perform well with the CachingWordShaper. However, since these scripts have break opportunities between almost every character, and since the line breaker, the most performance critical user of CachingWordShaper, splits text at break opportunities, this characteristic is not critical for the layout purpose. There are, however, other cases where this characteristic hits the performance negatively. The <select> element measures the width of all text in its <option> elements to determine its width, and since each <option> usually has unique text, this characteristic of CachingWordShaper slows it down significantly for CJK text. A benchmark of 2352 option elements with CJK text was slower than ASCII by 6-7 times. This patch improves it to 10-20% slower than ASCII on debug builds. There are other callers of HarfBuzzShaper with long text, such as paint, and this patch improves all of them for CJK scripts. BUG=565902, 570229 ==========
Message was sent while issue was closed.
Committed patchset #9 (id:200001)
Message was sent while issue was closed.
Description was changed from ========== Change CachingWordShapeIterator to delimit CJK characters This patch changes CachingWordShapeIterator to handle one CJK character as a word for the caching purpose. Since these scripts do not delimit words by spaces, text in these scripts do not perform well with the CachingWordShaper. However, since these scripts have break opportunities between almost every character, and since the line breaker, the most performance critical user of CachingWordShaper, splits text at break opportunities, this characteristic is not critical for the layout purpose. There are, however, other cases where this characteristic hits the performance negatively. The <select> element measures the width of all text in its <option> elements to determine its width, and since each <option> usually has unique text, this characteristic of CachingWordShaper slows it down significantly for CJK text. A benchmark of 2352 option elements with CJK text was slower than ASCII by 6-7 times. This patch improves it to 10-20% slower than ASCII on debug builds. There are other callers of HarfBuzzShaper with long text, such as paint, and this patch improves all of them for CJK scripts. BUG=565902, 570229 ========== to ========== Change CachingWordShapeIterator to delimit CJK characters This patch changes CachingWordShapeIterator to handle one CJK character as a word for the caching purpose. Since these scripts do not delimit words by spaces, text in these scripts do not perform well with the CachingWordShaper. However, since these scripts have break opportunities between almost every character, and since the line breaker, the most performance critical user of CachingWordShaper, splits text at break opportunities, this characteristic is not critical for the layout purpose. There are, however, other cases where this characteristic hits the performance negatively. The <select> element measures the width of all text in its <option> elements to determine its width, and since each <option> usually has unique text, this characteristic of CachingWordShaper slows it down significantly for CJK text. A benchmark of 2352 option elements with CJK text was slower than ASCII by 6-7 times. This patch improves it to 10-20% slower than ASCII on debug builds. There are other callers of HarfBuzzShaper with long text, such as paint, and this patch improves all of them for CJK scripts. BUG=565902, 570229 Committed: https://crrev.com/1494aab723061f943998a92653c41b6f7f8fa01f Cr-Commit-Position: refs/heads/master@{#366421} ==========
Message was sent while issue was closed.
Patchset 9 (id:??) landed as https://crrev.com/1494aab723061f943998a92653c41b6f7f8fa01f Cr-Commit-Position: refs/heads/master@{#366421}
Message was sent while issue was closed.
On 2015/12/21 at 18:27:05, commit-bot wrote: > Patchset 9 (id:??) landed as https://crrev.com/1494aab723061f943998a92653c41b6f7f8fa01f > Cr-Commit-Position: refs/heads/master@{#366421} I think this caused windows failures here: https://code.google.com/p/chromium/issues/detail?id=571481 Rebaselining issue? Making as ignored in test expectations.
Message was sent while issue was closed.
Some other layout test text failures here, they seem CJK related https://build.chromium.org/p/chromium.webkit/builders/WebKit%20Mac10.10/build... |