Issue 1525993005: Change CachingWordShapeIterator to delimit CJK characters

kojii

Description was changed from ========== Change CachingWordShapeIterator to handle one CJK character as a word ...

5 years ago (2015-12-16 03:24:46 UTC) #1

Description was changed from

==========
Change CachingWordShapeIterator to handle one CJK character as a word

This patch changes CachingWordShapeIterator to handle one CJK character
as a word for the caching purpose.

Since these scripts do not delimit words by spaces, text in these
scripts do not perform well with the CachingWordShaper. However, since
these scripts have break opportunities between almost every character,
and since the line breaker, the most performance critical user of
CachingWordShaper, splits text by break opportunities, this
characteristic is not critical for the layout purpose.

The <select> element measures the width of all text in its <option>
elements to determine its width, and since each <option> usually has
unique text, this characteristic of CachingWordShaper slows it down
when it has many <option> elements with CJK text.

A benchmark of 2352 option elements with CJK text was slower than ASCII
by 6~7 times. This patch improves it to 10~20% slower on debug builds.

BUG=565902

patch from issue 1525963002 at patchset 60001
(http://crrev.com/1525963002#ps60001)
==========

to

==========
Change CachingWordShapeIterator to handle one CJK character as a word

This patch changes CachingWordShapeIterator to handle one CJK character
as a word for the caching purpose.

Since these scripts do not delimit words by spaces, text in these
scripts do not perform well with the CachingWordShaper. However, since
these scripts have break opportunities between almost every character,
and since the line breaker, the most performance critical user of
CachingWordShaper, splits text by break opportunities, this
characteristic is not critical for the layout purpose.

The <select> element measures the width of all text in its <option>
elements to determine its width, and since each <option> usually has
unique text, this characteristic of CachingWordShaper slows it down
when it has many <option> elements with CJK text.

A benchmark of 2352 option elements with CJK text was slower than ASCII
by 6~7 times. This patch improves it to 10~20% slower on debug builds.

BUG=565902

patch from issue 1525963002 at patchset 60001
(http://crrev.com/1525963002#ps60001)
==========

kojii

kojii@chromium.org changed reviewers: + drott@chromium.org, eae@chromium.org

5 years ago (2015-12-17 05:06:48 UTC) #3

kojii

Still WIP, but script/font/fallback seemed to change and you might be interested in. From the ...

5 years ago (2015-12-17 05:06:49 UTC) #4

kojii

> 2. Some Emoji on Win/Mac fail to render as Emoji. Haven't figured out why. ...

5 years ago (2015-12-17 11:47:11 UTC) #5

kojii

Description was changed from ========== Change CachingWordShapeIterator to handle one CJK character as a word ...

5 years ago (2015-12-20 10:00:03 UTC) #7

Description was changed from

==========
Change CachingWordShapeIterator to handle one CJK character as a word

This patch changes CachingWordShapeIterator to handle one CJK character
as a word for the caching purpose.

Since these scripts do not delimit words by spaces, text in these
scripts do not perform well with the CachingWordShaper. However, since
these scripts have break opportunities between almost every character,
and since the line breaker, the most performance critical user of
CachingWordShaper, splits text by break opportunities, this
characteristic is not critical for the layout purpose.

The <select> element measures the width of all text in its <option>
elements to determine its width, and since each <option> usually has
unique text, this characteristic of CachingWordShaper slows it down
when it has many <option> elements with CJK text.

A benchmark of 2352 option elements with CJK text was slower than ASCII
by 6~7 times. This patch improves it to 10~20% slower on debug builds.

BUG=565902

patch from issue 1525963002 at patchset 60001
(http://crrev.com/1525963002#ps60001)
==========

to

==========
Change CachingWordShapeIterator to handle one CJK character as a word

This patch changes CachingWordShapeIterator to handle one CJK character
as a word for the caching purpose.

Since these scripts do not delimit words by spaces, text in these
scripts do not perform well with the CachingWordShaper. However, since
these scripts have break opportunities between almost every character,
and since the line breaker, the most performance critical user of
CachingWordShaper, splits text at break opportunities, this
characteristic is not critical for the layout purpose.

There are, however, other cases where this characteristic hits the
performance, and this patch improves such cases.

The <select> element measures the width of all text in its <option>
elements to determine its width, and since each <option> usually has
unique text, this characteristic of CachingWordShaper slows it down
significantly for CJK text. A benchmark of 2352 option elements with
CJK text was slower than ASCII by 6-7 times. This patch improves it to
10-20% slower than ASCII on debug builds.

There are other callers of HarfBuzzShaper with long text, such as
paint, and this patch improves all of them for CJK scripts.

BUG=565902, 570229
==========

kojii

Description was changed from ========== Change CachingWordShapeIterator to handle one CJK character as a word ...

5 years ago (2015-12-20 15:11:41 UTC) #8

Description was changed from

==========
Change CachingWordShapeIterator to handle one CJK character as a word

This patch changes CachingWordShapeIterator to handle one CJK character
as a word for the caching purpose.

Since these scripts do not delimit words by spaces, text in these
scripts do not perform well with the CachingWordShaper. However, since
these scripts have break opportunities between almost every character,
and since the line breaker, the most performance critical user of
CachingWordShaper, splits text at break opportunities, this
characteristic is not critical for the layout purpose.

There are, however, other cases where this characteristic hits the
performance, and this patch improves such cases.

The <select> element measures the width of all text in its <option>
elements to determine its width, and since each <option> usually has
unique text, this characteristic of CachingWordShaper slows it down
significantly for CJK text. A benchmark of 2352 option elements with
CJK text was slower than ASCII by 6-7 times. This patch improves it to
10-20% slower than ASCII on debug builds.

There are other callers of HarfBuzzShaper with long text, such as
paint, and this patch improves all of them for CJK scripts.

BUG=565902, 570229
==========

to

==========
Change CachingWordShapeIterator to delimit CJK characters

This patch changes CachingWordShapeIterator to handle one CJK character
as a word for the caching purpose.

Since these scripts do not delimit words by spaces, text in these
scripts do not perform well with the CachingWordShaper. However, since
these scripts have break opportunities between almost every character,
and since the line breaker, the most performance critical user of
CachingWordShaper, splits text at break opportunities, this
characteristic is not critical for the layout purpose.

There are, however, other cases where this characteristic hits the
performance, and this patch improves such cases.

The <select> element measures the width of all text in its <option>
elements to determine its width, and since each <option> usually has
unique text, this characteristic of CachingWordShaper slows it down
significantly for CJK text. A benchmark of 2352 option elements with
CJK text was slower than ASCII by 6-7 times. This patch improves it to
10-20% slower than ASCII on debug builds.

There are other callers of HarfBuzzShaper with long text, such as
paint, and this patch improves all of them for CJK scripts.

BUG=565902, 570229
==========

kojii

Description was changed from ========== Change CachingWordShapeIterator to delimit CJK characters This patch changes CachingWordShapeIterator ...

5 years ago (2015-12-20 15:14:20 UTC) #9

Description was changed from

==========
Change CachingWordShapeIterator to delimit CJK characters

This patch changes CachingWordShapeIterator to handle one CJK character
as a word for the caching purpose.

Since these scripts do not delimit words by spaces, text in these
scripts do not perform well with the CachingWordShaper. However, since
these scripts have break opportunities between almost every character,
and since the line breaker, the most performance critical user of
CachingWordShaper, splits text at break opportunities, this
characteristic is not critical for the layout purpose.

There are, however, other cases where this characteristic hits the
performance, and this patch improves such cases.

The <select> element measures the width of all text in its <option>
elements to determine its width, and since each <option> usually has
unique text, this characteristic of CachingWordShaper slows it down
significantly for CJK text. A benchmark of 2352 option elements with
CJK text was slower than ASCII by 6-7 times. This patch improves it to
10-20% slower than ASCII on debug builds.

There are other callers of HarfBuzzShaper with long text, such as
paint, and this patch improves all of them for CJK scripts.

BUG=565902, 570229
==========

to

==========
Change CachingWordShapeIterator to delimit CJK characters

This patch changes CachingWordShapeIterator to handle one CJK character
as a word for the caching purpose.

Since these scripts do not delimit words by spaces, text in these
scripts do not perform well with the CachingWordShaper. However, since
these scripts have break opportunities between almost every character,
and since the line breaker, the most performance critical user of
CachingWordShaper, splits text at break opportunities, this
characteristic is not critical for the layout purpose.

There are, however, other cases where this characteristic hits the
performance negatively.

The <select> element measures the width of all text in its <option>
elements to determine its width, and since each <option> usually has
unique text, this characteristic of CachingWordShaper slows it down
significantly for CJK text. A benchmark of 2352 option elements with
CJK text was slower than ASCII by 6-7 times. This patch improves it to
10-20% slower than ASCII on debug builds.

There are other callers of HarfBuzzShaper with long text, such as
paint, and this patch improves all of them for CJK scripts.

BUG=565902, 570229
==========

kojii

On 2015/12/20 16:14:32, kojii wrote: > eae@, PTAL. Forgot to note that this CL has ...

5 years ago (2015-12-20 16:18:39 UTC) #11

drott

Thanks for trying to improve the caching mechanism for CJK, Koji. It's a bit hard ...

5 years ago (2015-12-21 10:36:01 UTC) #12

kojii

https://codereview.chromium.org/1525993005/diff/140001/third_party/WebKit/Source/platform/fonts/shaping/CachingWordShapeIterator.h File third_party/WebKit/Source/platform/fonts/shaping/CachingWordShapeIterator.h (right): https://codereview.chromium.org/1525993005/diff/140001/third_party/WebKit/Source/platform/fonts/shaping/CachingWordShapeIterator.h#newcode121 third_party/WebKit/Source/platform/fonts/shaping/CachingWordShapeIterator.h:121: // Delimit every CJK character because these scripts do ...

5 years ago (2015-12-21 11:59:56 UTC) #13

https://codereview.chromium.org/1525993005/diff/140001/third_party/WebKit/Sou...
File third_party/WebKit/Source/platform/fonts/shaping/CachingWordShapeIterator.h
(right):

https://codereview.chromium.org/1525993005/diff/140001/third_party/WebKit/Sou...
third_party/WebKit/Source/platform/fonts/shaping/CachingWordShapeIterator.h:121:
// Delimit every CJK character because these scripts do not delimit
On 2015/12/21 10:36:01, drott wrote:
> AFAICS we have now three modes in this file: Separate by spaces, separate for
> tabulation, and separate by CJK codepoints + combining marks. 

I think that's not the correct way to see the options; I'd say:
1. We have tabulation mode on/off
2. We have word caching on/off
3. We have script-based word segmentation, currently support two flavors by code
points.

1 and 2 are independent flags that can be combined any way.

> The segmentation logic is getting really hard to read and some member
variables
> are starting to be wrongly named and misleading. Here we're in
"nextWordRange()"
> but what is returned is not a word range but a CJK "extended" character. It's
> really hard to follow from bool next(RefPtr<ShapeResult>* wordResult) to how
the
> text actually gets split and I'd hope we can improve that.
> 
> For example in line 62 we make a decision on whether we can use caching and
use
> "m_shapeByWord". Can the logic for whether this is a CJK run be put there?
Could
> we perhaps make this into a decider which breaking logic to use? m_shapeByMode

> out of { Spaces, Tabulation, CJK characters } or so?

As above, that's not correct, because there are Tabulation + Spaces and
Tabulation + CJK.

I agree that the level of calls are deep. I can merge nextWordRangeForCJK() to
nextWordRnage() if you prefer; it used to be so until the patch before, but
separated thinking non-CJK might not want to see detailed CJK logic. But doing
so can decrease one level.

> It's hard to keep track of
> what is going on if the m_shapeByWord variable does not actually tell what
> breaking strategy is used.

I think that's conceptual misunderstanding. m_shapeByWord indicates whether word
caching is possible or not. Currently we don't set it properly for CJK, assuming
its low priority, but it may be false for CJK glyphs too in future.

When word caching is enabled, we need script-based segmentation. I'd like to
make it as minimum as possible, but as you know, word segmentation with correct
i18n support is very hard. For instance, this logic does not support South Asian
scripts yet. It's not raised as an important issue so I'm currently ignoring,
but just want your understanding that the current word segmentation is very
simple.

> Can we somehow reflect the breaking strategy be more clearly in the code, then
> use the breaking strategy right from the next() method, instead of branching
> from nextWord to nextWordRangeForCJK()?

As I said, I can merge the two functions. I'll try it in the next patch.

But I wish your understanding, "segmenting to word" **is** a very complex
problem, we're trying to implement its minimal subset here, but the minimal is
probably more complicated than we might want it to be when i18n is involved.

drott

On 2015/12/21 11:59:56, kojii wrote: > https://codereview.chromium.org/1525993005/diff/140001/third_party/WebKit/Source/platform/fonts/shaping/CachingWordShapeIterator.h > File third_party/WebKit/Source/platform/fonts/shaping/CachingWordShapeIterator.h > (right): > > https://codereview.chromium.org/1525993005/diff/140001/third_party/WebKit/Source/platform/fonts/shaping/CachingWordShapeIterator.h#newcode121 ...

5 years ago (2015-12-21 12:26:03 UTC) #14

On 2015/12/21 11:59:56, kojii wrote:
>
https://codereview.chromium.org/1525993005/diff/140001/third_party/WebKit/Sou...
> File
third_party/WebKit/Source/platform/fonts/shaping/CachingWordShapeIterator.h
> (right):
> 
>
https://codereview.chromium.org/1525993005/diff/140001/third_party/WebKit/Sou...
>
third_party/WebKit/Source/platform/fonts/shaping/CachingWordShapeIterator.h:121:
> // Delimit every CJK character because these scripts do not delimit
> On 2015/12/21 10:36:01, drott wrote:
> > AFAICS we have now three modes in this file: Separate by spaces, separate
for
> > tabulation, and separate by CJK codepoints + combining marks. 
> 
> I think that's not the correct way to see the options; I'd say:
> 1. We have tabulation mode on/off
> 2. We have word caching on/off
> 3. We have script-based word segmentation, currently support two flavors by
code
> points.

So, by 3) you mean: A splitting logic that is different per script? Okay - so
you would call the split-off CJK "units" "words" as well? That's what irritates
me a bit - since as far as I understand they are not words really, but rather
character sequences, potentially a part of a word?

> 1 and 2 are independent flags that can be combined any way.
> 
> > The segmentation logic is getting really hard to read and some member
> variables
> > are starting to be wrongly named and misleading. Here we're in
> "nextWordRange()"
> > but what is returned is not a word range but a CJK "extended" character.
It's
> > really hard to follow from bool next(RefPtr<ShapeResult>* wordResult) to how
> the
> > text actually gets split and I'd hope we can improve that.
> > 
> > For example in line 62 we make a decision on whether we can use caching and
> use
> > "m_shapeByWord". Can the logic for whether this is a CJK run be put there?
> Could
> > we perhaps make this into a decider which breaking logic to use?
m_shapeByMode
> 
> > out of { Spaces, Tabulation, CJK characters } or so?
> 
> As above, that's not correct, because there are Tabulation + Spaces and
> Tabulation + CJK.

Okay, thanks for the clarification.

> I agree that the level of calls are deep. I can merge nextWordRangeForCJK() to
> nextWordRnage() if you prefer; it used to be so until the patch before, but
> separated thinking non-CJK might not want to see detailed CJK logic. But doing
> so can decrease one level.
> 
> > It's hard to keep track of
> > what is going on if the m_shapeByWord variable does not actually tell what
> > breaking strategy is used.
> 
> I think that's conceptual misunderstanding. m_shapeByWord indicates whether
word
> caching is possible or not. Currently we don't set it properly for CJK,
assuming
> its low priority, but it may be false for CJK glyphs too in future.

The meaning of the flag is clear. Perhaps I didn't express myself well. IMO, the
naming could be improved, since in CJK the individual units are not really
"words" IIUC.  But I see what you mean regarding the combinations of modes.

> When word caching is enabled, we need script-based segmentation. I'd like to
> make it as minimum as possible, but as you know, word segmentation with
correct
> i18n support is very hard. For instance, this logic does not support South
Asian
> scripts yet. It's not raised as an important issue so I'm currently ignoring,
> but just want your understanding that the current word segmentation is very
> simple.

There is a well-tested ScriptRunIterator which we use for the shaping step, is
used in RunSegmenter.cpp to segment for shaping. - In fact, it would be great if
we have only script, small caps, and orientation resolved entries in the word
cache, which is the output of the RunSegmenter. The sub-runs returned by this
step are suitable for shaping. Would it make sense to move the caching below
this segmentation step?

> > Can we somehow reflect the breaking strategy be more clearly in the code,
then
> > use the breaking strategy right from the next() method, instead of branching
> > from nextWord to nextWordRangeForCJK()?
> 
> As I said, I can merge the two functions. I'll try it in the next patch.
> 
> But I wish your understanding, "segmenting to word" **is** a very complex
> problem, we're trying to implement its minimal subset here, but the minimal is
> probably more complicated than we might want it to be when i18n is involved.

Word separation is complex, yes - I am not sure what you mean by "but the
minimal is
 probably more complicated than we might want it to be when i18n is involved."

drott

LGTM, I find this clearer. (With a nit on adding a common-only test case). Sorry ...

5 years ago (2015-12-21 14:57:42 UTC) #15

kojii

Refactored in PS7 given your comment, along with the other CL. This time, the focus ...

5 years ago (2015-12-21 15:02:59 UTC) #16

Refactored in PS7 given your comment, along with the other CL. This time, the
focus is not to modular too much, but:
* Easier to read what nextWord() does.
* Avoid too deep functions when sharing the code does not give much values.
* Less diff for easier review.
Hope this aligns to your comments.

More replies follow inline:

On 2015/12/21 12:26:03, drott wrote:
> > 
> > I think that's not the correct way to see the options; I'd say:
> > 1. We have tabulation mode on/off
> > 2. We have word caching on/off
> > 3. We have script-based word segmentation, currently support two flavors by
> code
> > points.
> 
> So, by 3) you mean: A splitting logic that is different per script? Okay - so
> you would call the split-off CJK "units" "words" as well? That's what
irritates
> me a bit - since as far as I understand they are not words really, but rather
> character sequences, potentially a part of a word?

The "word" is ambiguous by definition. For example, when an input "Hello." was
given, this is a word for CachingWordShaper but is two words for the selection.

When performance is desired, defining a single CJK character as a word is widely
used technique. There are other variations to quickly determine a "word" for
CJK, but for this purpose, I think this is the best one.

True that it's not a "word" in morphological analysis, but we're defining what a
"word" is for CachingWordShaper, right? I'm just changing the definition to suit
better for our goal.

> The meaning of the flag is clear. Perhaps I didn't express myself well. IMO,
the
> naming could be improved, since in CJK the individual units are not really
> "words" IIUC.  But I see what you mean regarding the combinations of modes.

Call it a "unit" instead, since it's really not a "word" for Latin either? I
think it's more accurate, but is harder to understand for whoever does not need
that much details.

i18n-friendly "word" is defined accordingly to its purpose, and could have
multiple definitions within a single product. We don't rename "word-selection"
to "unit-selection". Unicode Text Segmentation[1] has "word" but it's never be a
morphological word for Asians. In these cases, "word" is higher-level concept
that mimics what a "word" is for Latin.

[1] http://unicode.org/reports/tr29/

> There is a well-tested ScriptRunIterator which we use for the shaping step, is
> used in RunSegmenter.cpp to segment for shaping. - In fact, it would be great
if
> we have only script, small caps, and orientation resolved entries in the word
> cache, which is the output of the RunSegmenter. The sub-runs returned by this
> step are suitable for shaping. Would it make sense to move the caching below
> this segmentation step?

Ah, that's a good idea. I haven't read RunSegmenter well enough to determine if
it's a good idea or not yet, but that's a good candidate to improve this
further.

Another idea I was thinking was to run the line breaker. Since the line breaker
is the heaviest user of width(), if all other inputs are aligned to the line
breaker, the cache should perform the best.

One more thing. When I was looking at test failures of early patches in this CL,
I noticed a design issue that, since the line breaker and paint gives different
runs to CachingWordShaper (line breaker splits at break opportunities,) we
determine different scripts/fallback fonts for the line breaker and paint. This
is probably the root cause of slight changes in pixels.

I haven't come up with how to solve this yet, hope to discuss more with you and
Emil, but we need to land this fix much earlier. As you might be aware of,
multiple people are asking me to revert the complex path due to this issue, and
this is currently beta-blocker. People complains that because of this, other
features are not getting enough dogfood.

Can we try to ease their pains first?

> > > Can we somehow reflect the breaking strategy be more clearly in the code,
> then
> > > use the breaking strategy right from the next() method, instead of
branching
> > > from nextWord to nextWordRangeForCJK()?
> > 
> > As I said, I can merge the two functions. I'll try it in the next patch.
> > 
> > But I wish your understanding, "segmenting to word" **is** a very complex
> > problem, we're trying to implement its minimal subset here, but the minimal
is
> > probably more complicated than we might want it to be when i18n is involved.
> 
> Word separation is complex, yes - I am not sure what you mean by "but the
> minimal is
> probably more complicated than we might want it to be when i18n is involved."

Not much, sorry, please disregard that part. I haven't come up a good plan for
Thai yet, wondering that might be an issue sometime in future and this code may
be a bit more complicated when it happens. But nothing concrete yet.

kojii

On 2015/12/21 14:57:42, drott wrote: > LGTM, I find this clearer. (With a nit on ...

5 years ago (2015-12-21 15:34:54 UTC) #17

drott

On 2015/12/21 15:02:59, kojii wrote: > Refactored in PS7 given your comment, along with the ...

5 years ago (2015-12-21 15:35:11 UTC) #18

On 2015/12/21 15:02:59, kojii wrote:
> Refactored in PS7 given your comment, along with the other CL. This time, the
> focus is not to modular too much, but:
> * Easier to read what nextWord() does.
> * Avoid too deep functions when sharing the code does not give much values.
> * Less diff for easier review.
> Hope this aligns to your comments.

Yes, thanks. What do you mean by "along with the other CL" - is this still
dependent on the other CL or is it all in one CL now?

> i18n-friendly "word" is defined accordingly to its purpose, and could have
> multiple definitions within a single product. We don't rename "word-selection"
> to "unit-selection". Unicode Text Segmentation[1] has "word" but it's never be
a
> morphological word for Asians. In these cases, "word" is higher-level concept
> that mimics what a "word" is for Latin.

Thanks for the clarification - yes, I am okay with calling it word in this
context, for lack of a better word :-) for it.

> > There is a well-tested ScriptRunIterator which we use for the shaping step,
is
> > used in RunSegmenter.cpp to segment for shaping. - In fact, it would be
great
> if
> > we have only script, small caps, and orientation resolved entries in the
word
> > cache, which is the output of the RunSegmenter. The sub-runs returned by
this
> > step are suitable for shaping. Would it make sense to move the caching below
> > this segmentation step?
> 
> Ah, that's a good idea. I haven't read RunSegmenter well enough to determine
if
> it's a good idea or not yet, but that's a good candidate to improve this
> further.
> 
> Another idea I was thinking was to run the line breaker. Since the line
breaker
> is the heaviest user of width(), if all other inputs are aligned to the line
> breaker, the cache should perform the best.

I don't think I understand this suggestion. Could you explain a bit more? We're
having the WordCache to make the line breaker's job easier. If you mean
something like a "line cache", I think we've seen that this is rather
ineffective in many cases (does not work well for re-flows, etc.) and it's
similar to the "run cache" we had before, which did not lead to the expected
performance improvements.

> One more thing. When I was looking at test failures of early patches in this
CL,
> I noticed a design issue that, since the line breaker and paint gives
different
> runs to CachingWordShaper (line breaker splits at break opportunities,) we
> determine different scripts/fallback fonts for the line breaker and paint.
This
> is probably the root cause of slight changes in pixels.

That sounds nasty, would you have an example for such a case? 

> I haven't come up with how to solve this yet, hope to discuss more with you
and
> Emil,

One of my thoughts is that it would help to have segmentation info higher up,
closer to TextRun and keep it there for all further text processing

>  ... but we need to land this fix much earlier. As you might be aware of,
> multiple people are asking me to revert the complex path due to this issue,
and
> this is currently beta-blocker. People complains that because of this, other
> features are not getting enough dogfood.


> Can we try to ease their pains first?

Yes, I think we can land this change. I'd like to hear eae@'s opinion as well,
but I think we can land this now and revert if he has concerns with this change.

kojii

On 2015/12/21 15:35:11, drott wrote: > On 2015/12/21 15:02:59, kojii wrote: > > Refactored in ...

5 years ago (2015-12-21 16:04:52 UTC) #19

On 2015/12/21 15:35:11, drott wrote:
> On 2015/12/21 15:02:59, kojii wrote:
> > Refactored in PS7 given your comment, along with the other CL. This time,
the
> > focus is not to modular too much, but:
> > * Easier to read what nextWord() does.
> > * Avoid too deep functions when sharing the code does not give much values.
> > * Less diff for easier review.
> > Hope this aligns to your comments.
> 
> Yes, thanks. What do you mean by "along with the other CL" - is this still
> dependent on the other CL or is it all in one CL now?

Thanks, I just found you reviewed the other one.

> > > There is a well-tested ScriptRunIterator which we use for the shaping
step,
> is
> > > used in RunSegmenter.cpp to segment for shaping. - In fact, it would be
> great
> > if
> > > we have only script, small caps, and orientation resolved entries in the
> word
> > > cache, which is the output of the RunSegmenter. The sub-runs returned by
> this
> > > step are suitable for shaping. Would it make sense to move the caching
below
> > > this segmentation step?
> > 
> > Ah, that's a good idea. I haven't read RunSegmenter well enough to determine
> if
> > it's a good idea or not yet, but that's a good candidate to improve this
> > further.
> > 
> > Another idea I was thinking was to run the line breaker. Since the line
> breaker
> > is the heaviest user of width(), if all other inputs are aligned to the line
> > breaker, the cache should perform the best.
> 
> I don't think I understand this suggestion. Could you explain a bit more?
We're
> having the WordCache to make the line breaker's job easier. If you mean
> something like a "line cache", I think we've seen that this is rather
> ineffective in many cases (does not work well for re-flows, etc.) and it's
> similar to the "run cache" we had before, which did not lead to the expected
> performance improvements.

No, sorry, I didn't mean that.

I was actually the same understanding -- when the line breaker performs well,
other usage of CachingWordShaper is low enough. This issue proved that it's not
the case. So one way to solve that is that, if all other uses of
CachingWordShaper pass the similar input as the line breaker, we can optimize us
for the line breaker and others should simply follow.

i.e., somehow split out the line breaker for a string, call it to break the
input string, and then give the result to CachingWordShaper. This can solve Thai
and some Indic scripts, but I'm not confident that's the best way or not yet.

> > One more thing. When I was looking at test failures of early patches in this
> CL,
> > I noticed a design issue that, since the line breaker and paint gives
> different
> > runs to CachingWordShaper (line breaker splits at break opportunities,) we
> > determine different scripts/fallback fonts for the line breaker and paint.
> This
> > is probably the root cause of slight changes in pixels.
> 
> That sounds nasty, would you have an example for such a case? 

Maybe these are simply bugs, but this CL is changing some paint results.

One case is that "13inch" where "inch" is Katakana, and no space between them.
Line breaker splits "13" and "inch", while paint gives the whole string. In this
case, the "inch" picks different font from other Katakana.
https://storage.googleapis.com/chromium-layout-test-archives/win_chromium_rel...

Another case is when the first CJK character of a line does not have a glyph,
all following CJK glyphs are also Tofu. This CL improves that.
https://storage.googleapis.com/chromium-layout-test-archives/linux_chromium_r...

Logically speaking, this could happen to Latin as well. "!" and "<" can break
between. See our line-break matrix here:
http://kojiishi.github.io/line-break-matrix/

It could be even worse when "word-break: break-all" is used.

I think we can summary to two issues, though not sure how much important they
are atm:
1. When line-breaker or any other callers split more text than we need to
determine scripts/fonts, we fail.
2. Since line-breaker and paints split text differently, we may calculate
differently for them.

> > I haven't come up with how to solve this yet, hope to discuss more with you
> and
> > Emil,
> 
> One of my thoughts is that it would help to have segmentation info higher up,
> closer to TextRun and keep it there for all further text processing

Agreed. I wish it be even higher than TextRun, but not sure where it should be
yet.

> >  ... but we need to land this fix much earlier. As you might be aware of,
> > multiple people are asking me to revert the complex path due to this issue,
> and
> > this is currently beta-blocker. People complains that because of this, other
> > features are not getting enough dogfood.
> 
> 
> > Can we try to ease their pains first?
> 
> Yes, I think we can land this change. I'd like to hear eae@'s opinion as well,
> but I think we can land this now and revert if he has concerns with this
change.

Thanks a lot!

drott

> Maybe these are simply bugs, but this CL is changing some paint results. > ...

5 years ago (2015-12-21 16:16:16 UTC) #20

kojii

The patchset sent to the CQ was uploaded after l-g-t-m from drott@chromium.org Link to the ...

5 years ago (2015-12-21 18:20:03 UTC) #22

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1525993005/200001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1525993005/200001

5 years ago (2015-12-21 18:20:34 UTC) #23

commit-bot: I haz the power

Description was changed from ========== Change CachingWordShapeIterator to delimit CJK characters This patch changes CachingWordShapeIterator ...

5 years ago (2015-12-21 18:26:02 UTC) #24

Message was sent while issue was closed.

Description was changed from

==========
Change CachingWordShapeIterator to delimit CJK characters

This patch changes CachingWordShapeIterator to handle one CJK character
as a word for the caching purpose.

Since these scripts do not delimit words by spaces, text in these
scripts do not perform well with the CachingWordShaper. However, since
these scripts have break opportunities between almost every character,
and since the line breaker, the most performance critical user of
CachingWordShaper, splits text at break opportunities, this
characteristic is not critical for the layout purpose.

There are, however, other cases where this characteristic hits the
performance negatively.

The <select> element measures the width of all text in its <option>
elements to determine its width, and since each <option> usually has
unique text, this characteristic of CachingWordShaper slows it down
significantly for CJK text. A benchmark of 2352 option elements with
CJK text was slower than ASCII by 6-7 times. This patch improves it to
10-20% slower than ASCII on debug builds.

There are other callers of HarfBuzzShaper with long text, such as
paint, and this patch improves all of them for CJK scripts.

BUG=565902, 570229
==========

to

==========
Change CachingWordShapeIterator to delimit CJK characters

This patch changes CachingWordShapeIterator to handle one CJK character
as a word for the caching purpose.

Since these scripts do not delimit words by spaces, text in these
scripts do not perform well with the CachingWordShaper. However, since
these scripts have break opportunities between almost every character,
and since the line breaker, the most performance critical user of
CachingWordShaper, splits text at break opportunities, this
characteristic is not critical for the layout purpose.

There are, however, other cases where this characteristic hits the
performance negatively.

The <select> element measures the width of all text in its <option>
elements to determine its width, and since each <option> usually has
unique text, this characteristic of CachingWordShaper slows it down
significantly for CJK text. A benchmark of 2352 option elements with
CJK text was slower than ASCII by 6-7 times. This patch improves it to
10-20% slower than ASCII on debug builds.

There are other callers of HarfBuzzShaper with long text, such as
paint, and this patch improves all of them for CJK scripts.

BUG=565902, 570229
==========

commit-bot: I haz the power

Description was changed from ========== Change CachingWordShapeIterator to delimit CJK characters This patch changes CachingWordShapeIterator ...

5 years ago (2015-12-21 18:27:04 UTC) #26

Message was sent while issue was closed.

Description was changed from

==========
Change CachingWordShapeIterator to delimit CJK characters

This patch changes CachingWordShapeIterator to handle one CJK character
as a word for the caching purpose.

Since these scripts do not delimit words by spaces, text in these
scripts do not perform well with the CachingWordShaper. However, since
these scripts have break opportunities between almost every character,
and since the line breaker, the most performance critical user of
CachingWordShaper, splits text at break opportunities, this
characteristic is not critical for the layout purpose.

There are, however, other cases where this characteristic hits the
performance negatively.

The <select> element measures the width of all text in its <option>
elements to determine its width, and since each <option> usually has
unique text, this characteristic of CachingWordShaper slows it down
significantly for CJK text. A benchmark of 2352 option elements with
CJK text was slower than ASCII by 6-7 times. This patch improves it to
10-20% slower than ASCII on debug builds.

There are other callers of HarfBuzzShaper with long text, such as
paint, and this patch improves all of them for CJK scripts.

BUG=565902, 570229
==========

to

==========
Change CachingWordShapeIterator to delimit CJK characters

This patch changes CachingWordShapeIterator to handle one CJK character
as a word for the caching purpose.

Since these scripts do not delimit words by spaces, text in these
scripts do not perform well with the CachingWordShaper. However, since
these scripts have break opportunities between almost every character,
and since the line breaker, the most performance critical user of
CachingWordShaper, splits text at break opportunities, this
characteristic is not critical for the layout purpose.

There are, however, other cases where this characteristic hits the
performance negatively.

The <select> element measures the width of all text in its <option>
elements to determine its width, and since each <option> usually has
unique text, this characteristic of CachingWordShaper slows it down
significantly for CJK text. A benchmark of 2352 option elements with
CJK text was slower than ASCII by 6-7 times. This patch improves it to
10-20% slower than ASCII on debug builds.

There are other callers of HarfBuzzShaper with long text, such as
paint, and this patch improves all of them for CJK scripts.

BUG=565902, 570229

Committed: https://crrev.com/1494aab723061f943998a92653c41b6f7f8fa01f
Cr-Commit-Position: refs/heads/master@{#366421}
==========

commit-bot: I haz the power

Patchset 9 (id:??) landed as https://crrev.com/1494aab723061f943998a92653c41b6f7f8fa01f Cr-Commit-Position: refs/heads/master@{#366421}

5 years ago (2015-12-21 18:27:05 UTC) #27

dmurph

On 2015/12/21 at 18:27:05, commit-bot wrote: > Patchset 9 (id:??) landed as https://crrev.com/1494aab723061f943998a92653c41b6f7f8fa01f > Cr-Commit-Position: ...

5 years ago (2015-12-21 22:16:30 UTC) #28

Noel Gordon

5 years ago (2015-12-21 22:46:44 UTC) #29

Message was sent while issue was closed.

Some other layout test text failures here, they seem CJK related

https://build.chromium.org/p/chromium.webkit/builders/WebKit%20Mac10.10/build...

Issue 1525993005: Change CachingWordShapeIterator to delimit CJK characters (Closed)

Description

Patch Set 1 #

Patch Set 2 : Fix allowTabs case #

Patch Set 3 : Avoid split before COMMON/INHERITED #

Patch Set 4 : Some fixes, cleanup, and more tests #

Patch Set 5 : Minor fixes, more tests, and cleanup #

Patch Set 6 : TestExpectations #

Patch Set 7 : Refactor to clarify what nextWord does and less functions #

Patch Set 8 : Add test, rebase, TestExpectations #

Patch Set 9 : TestExpectations #

Messages