|
|
Created:
5 years, 6 months ago by Changwan Ryu Modified:
5 years, 6 months ago CC:
chromium-reviews, darin-cc_chromium.org, jam Base URL:
https://chromium.googlesource.com/chromium/src.git@master Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
DescriptionMake sure multi chracter codepoints are deleted correctly
deleteSurroundingText() only deletes one character even for multi-character
codepoint.
On the blink side, we have InputMethodController::extendSelectionAndDelete()
to make sure selection and deletion respect Unicode boundaries. However,
AdapterInputConnection keeps track of selection region separately, and this
value is incorrectly updated.
On top of adding a new test for this case, it extends
waitAndVerifyEditorCallback to also check the outbound calls to
InputMethodManager.
The above extension found that testEnterKeyEventWhileComposingText fails
because there is hidden discrepancy between blink implementation and
what we report to InputMethodManager. So I've added a TODO for that.
BUG=497091
Committed: https://crrev.com/5f6e036312bc4a978768f5b5971eee1a5ec9f272
Cr-Commit-Position: refs/heads/master@{#333704}
Patch Set 1 #Patch Set 2 : #Patch Set 3 : #
Total comments: 3
Patch Set 4 : addressed brian's comment and TODOs #Patch Set 5 : fixed findbug error #Patch Set 6 : do not check updateSelection for some of the tests #
Messages
Total messages: 28 (7 generated)
changwan@chromium.org changed reviewers: + aelias@chromium.org, aurimas@chromium.org, bcwhite@chromium.org
lgtm
LGTM https://codereview.chromium.org/1165793007/diff/40001/content/public/android/... File content/public/android/java/src/org/chromium/content/browser/input/AdapterInputConnection.java (right): https://codereview.chromium.org/1165793007/diff/40001/content/public/android/... content/public/android/java/src/org/chromium/content/browser/input/AdapterInputConnection.java:362: public static int adjustLengthIfAtUtf16Boundary(CharSequence str, int index) { If it's just visible for testing, can you remove "public"? Presumably any tests will be in the same package and so will have visibility without the modifier.
Hmm, this fix doesn't seem like the right approach to me. First of all, the "adjustLengthIfAtUTF16Boundary" method feels like reinventing the wheel. There has got to be existing standard-library methods to do anything like this. If there isn't any similar standard-library method, then it's probably not a particularly legit thing to do with Unicode. More fundamentally, the emoji is only one character so the selection size should be only 1, not 2. Some part(s) of our system got confused and thought it's 2 characters instead of 1 because it happens to made out of 2 UTF-16 words. I think the backspace/selection code should all be thinking it's only acting on one character, and there should be some lower level charset/encoding layer that takes removing the 4 bytes from the underlying data structure, invisibly to the higher-level code.
"not lgtm" to clear out previous lgtm (particularly Aurimas's which was on a test-only variant of this patch)
On 2015/06/05 19:31:12, aelias wrote: > Hmm, this fix doesn't seem like the right approach to me. First of all, the > "adjustLengthIfAtUTF16Boundary" method feels like reinventing the wheel. There > has got to be existing standard-library methods to do anything like this. If > there isn't any similar standard-library method, then it's probably not a > particularly legit thing to do with Unicode. > Very similar functions are found in some libraries that we aren't importing: com.google.android.common.base#unicodePreservingIndex com.google.base.Strings#validSurrogatePairAt > More fundamentally, the emoji is only one character so the selection size should > be only 1, not 2. Some part(s) of our system got confused and thought it's 2 > characters instead of 1 because it happens to made out of 2 UTF-16 words. I > think the backspace/selection code should all be thinking it's only acting on > one character, and there should be some lower level charset/encoding layer that > takes removing the 4 bytes from the underlying data structure, invisibly to the > higher-level code. I might be wrong, but I think the selection size should be 2 because of the following observations: - All the basic implementations of CharSequence and CharSequenceIterator are in UTF-16 encoding, even though it is still theoretically possible that some third-party implementations may not be. So CharSequence.charAt() and size() all take the emoji as two characters. - Some functions in android.view.inputmethod.InputConnection explicitly mentions 'characters' which IMO refers to CharSequence characters. - ExtractedText has CharSequence and offsets. I think it would be confusing to IME developers if they cannot apply basic offset functions to the CharSequence text. I think we just need to be more careful at deletion and selection (maybe setComposingRegion should be fixed as well in the future).
On 2015/06/05 at 22:19:33, changwan wrote: > On 2015/06/05 19:31:12, aelias wrote: > > Hmm, this fix doesn't seem like the right approach to me. First of all, the > > "adjustLengthIfAtUTF16Boundary" method feels like reinventing the wheel. There > > has got to be existing standard-library methods to do anything like this. If > > there isn't any similar standard-library method, then it's probably not a > > particularly legit thing to do with Unicode. > > > > Very similar functions are found in some libraries that we aren't importing: > > com.google.android.common.base#unicodePreservingIndex > com.google.base.Strings#validSurrogatePairAt > Hmm, I'm only finding that unicodePreservingIndex in code related to the email app, is that actually in a base library? > > More fundamentally, the emoji is only one character so the selection size should > > be only 1, not 2. Some part(s) of our system got confused and thought it's 2 > > characters instead of 1 because it happens to made out of 2 UTF-16 words. I > > think the backspace/selection code should all be thinking it's only acting on > > one character, and there should be some lower level charset/encoding layer that > > takes removing the 4 bytes from the underlying data structure, invisibly to the > > higher-level code. > > I might be wrong, but I think the selection size should be 2 because of the following observations: > > - All the basic implementations of CharSequence and CharSequenceIterator are in UTF-16 encoding, > even though it is still theoretically possible that some third-party implementations may not be. > So CharSequence.charAt() and size() all take the emoji as two characters. > - Some functions in android.view.inputmethod.InputConnection explicitly mentions 'characters' > which IMO refers to CharSequence characters. > - ExtractedText has CharSequence and offsets. I think it would be confusing to IME developers if > they cannot apply basic offset functions to the CharSequence text. > > I think we just need to be more careful at deletion and selection (maybe setComposingRegion should be fixed as well in the future). Hmm, you have a point since https://docs.oracle.com/javase/7/docs/api/java/lang/CharSequence.html says each character "represents a character in the Basic Multilingual Plane (BMP) or a surrogate". That seems like an awful legacy of the 90s but I guess we're going to have to work with it. Note that another place says "In the Java SE API documentation, Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char values that are code units of the UTF-16 encoding." In future discussions, let's use these terms instead of "character" when it's ambiguous. What happens in a TextView when emoji are inputted? Does it take up 2 spaces in the selection coordinates or 1?
On 2015/06/05 22:54:14, aelias wrote: > On 2015/06/05 at 22:19:33, changwan wrote: > > On 2015/06/05 19:31:12, aelias wrote: > > > Hmm, this fix doesn't seem like the right approach to me. First of all, the > > > "adjustLengthIfAtUTF16Boundary" method feels like reinventing the wheel. > There > > > has got to be existing standard-library methods to do anything like this. > If > > > there isn't any similar standard-library method, then it's probably not a > > > particularly legit thing to do with Unicode. > > > > > > > Very similar functions are found in some libraries that we aren't importing: > > > > com.google.android.common.base#unicodePreservingIndex > > com.google.base.Strings#validSurrogatePairAt > > > > Hmm, I'm only finding that unicodePreservingIndex in code related to the email > app, is that actually in a base library? Please check http://crbug.com/497091 > > > > More fundamentally, the emoji is only one character so the selection size > should > > > be only 1, not 2. Some part(s) of our system got confused and thought it's > 2 > > > characters instead of 1 because it happens to made out of 2 UTF-16 words. I > > > think the backspace/selection code should all be thinking it's only acting > on > > > one character, and there should be some lower level charset/encoding layer > that > > > takes removing the 4 bytes from the underlying data structure, invisibly to > the > > > higher-level code. > > > > I might be wrong, but I think the selection size should be 2 because of the > following observations: > > > > - All the basic implementations of CharSequence and CharSequenceIterator are > in UTF-16 encoding, > > even though it is still theoretically possible that some third-party > implementations may not be. > > So CharSequence.charAt() and size() all take the emoji as two characters. > > - Some functions in android.view.inputmethod.InputConnection explicitly > mentions 'characters' > > which IMO refers to CharSequence characters. > > - ExtractedText has CharSequence and offsets. I think it would be confusing to > IME developers if > > they cannot apply basic offset functions to the CharSequence text. > > > > I think we just need to be more careful at deletion and selection (maybe > setComposingRegion should be fixed as well in the future). > > Hmm, you have a point since > https://docs.oracle.com/javase/7/docs/api/java/lang/CharSequence.html says each > character "represents a character in the Basic Multilingual Plane (BMP) or a > surrogate". That seems like an awful legacy of the 90s but I guess we're going > to have to work with it. > > Note that another place says "In the Java SE API documentation, Unicode code > point is used for character values in the range between U+0000 and U+10FFFF, and > Unicode code unit is used for 16-bit char values that are code units of the > UTF-16 encoding." In future discussions, let's use these terms instead of > "character" when it's ambiguous. > > What happens in a TextView when emoji are inputted? Does it take up 2 spaces in > the selection coordinates or 1? I'll have to either hook up Android or a emoji enabled IME to figure this out. I'm starting to download Android source code.
On 2015/06/08 06:05:57, Changwan Ryu wrote: > On 2015/06/05 22:54:14, aelias wrote: > > On 2015/06/05 at 22:19:33, changwan wrote: > > > On 2015/06/05 19:31:12, aelias wrote: > > > > Hmm, this fix doesn't seem like the right approach to me. First of all, > the > > > > "adjustLengthIfAtUTF16Boundary" method feels like reinventing the wheel. > > There > > > > has got to be existing standard-library methods to do anything like this. > > If > > > > there isn't any similar standard-library method, then it's probably not a > > > > particularly legit thing to do with Unicode. > > > > > > > > > > Very similar functions are found in some libraries that we aren't importing: > > > > > > com.google.android.common.base#unicodePreservingIndex > > > com.google.base.Strings#validSurrogatePairAt > > > > > > > Hmm, I'm only finding that unicodePreservingIndex in code related to the email > > app, is that actually in a base library? > Please check http://crbug.com/497091 > > > > > > > More fundamentally, the emoji is only one character so the selection size > > should > > > > be only 1, not 2. Some part(s) of our system got confused and thought > it's > > 2 > > > > characters instead of 1 because it happens to made out of 2 UTF-16 words. > I > > > > think the backspace/selection code should all be thinking it's only acting > > on > > > > one character, and there should be some lower level charset/encoding layer > > that > > > > takes removing the 4 bytes from the underlying data structure, invisibly > to > > the > > > > higher-level code. > > > > > > I might be wrong, but I think the selection size should be 2 because of the > > following observations: > > > > > > - All the basic implementations of CharSequence and CharSequenceIterator are > > in UTF-16 encoding, > > > even though it is still theoretically possible that some third-party > > implementations may not be. > > > So CharSequence.charAt() and size() all take the emoji as two characters. > > > - Some functions in android.view.inputmethod.InputConnection explicitly > > mentions 'characters' > > > which IMO refers to CharSequence characters. > > > - ExtractedText has CharSequence and offsets. I think it would be confusing > to > > IME developers if > > > they cannot apply basic offset functions to the CharSequence text. > > > > > > I think we just need to be more careful at deletion and selection (maybe > > setComposingRegion should be fixed as well in the future). > > > > Hmm, you have a point since > > https://docs.oracle.com/javase/7/docs/api/java/lang/CharSequence.html says > each > > character "represents a character in the Basic Multilingual Plane (BMP) or a > > surrogate". That seems like an awful legacy of the 90s but I guess we're > going > > to have to work with it. > > > > Note that another place says "In the Java SE API documentation, Unicode code > > point is used for character values in the range between U+0000 and U+10FFFF, > and > > Unicode code unit is used for 16-bit char values that are code units of the > > UTF-16 encoding." In future discussions, let's use these terms instead of > > "character" when it's ambiguous. > > > > What happens in a TextView when emoji are inputted? Does it take up 2 spaces > in > > the selection coordinates or 1? > > I'll have to either hook up Android or a emoji enabled IME to figure this out. > I'm starting to download Android source code. Please check the crbug. Android's EditText also takes the emoji as two 'characters'.
OK, I suppose the approach is fine, just a minor comment below. Please also address your TODOs and bcwhite's comment. I'd also suggest playing around more with 32-bit emoji and search for more bugs while you're looking into this. There's a lot of methods that take indices into the selection, I wonder if the keyboard and Blink are both 100% rigorous in not sending down such invalid indices... https://codereview.chromium.org/1165793007/diff/40001/content/public/android/... File content/public/android/java/src/org/chromium/content/browser/input/AdapterInputConnection.java (right): https://codereview.chromium.org/1165793007/diff/40001/content/public/android/... content/public/android/java/src/org/chromium/content/browser/input/AdapterInputConnection.java:362: public static int adjustLengthIfAtUtf16Boundary(CharSequence str, int index) { I'd think this would be a bit clearer/more reusable as "boolean indexBetweenUtf16SurrogatePair" and have the callers do: if (indexBetweenUtf16SurrogatePair(...)) beforeLength += 1;
PTAL I've extended waitAndVerifyEditableCallback to also check outbound calls to InputMethodManager, and removed some of the TODOs. However, I've found that there is some discrepancy between the text input state values and the selection values in testEnterKeyEventWhileComposingText, so I had to comment one line in this CL. I think many of the tests in ImeTest are somewhat flaky, though. I'll have to run git cl try to check whether the change introduces new test breakage. https://codereview.chromium.org/1165793007/diff/40001/content/public/android/... File content/public/android/java/src/org/chromium/content/browser/input/AdapterInputConnection.java (right): https://codereview.chromium.org/1165793007/diff/40001/content/public/android/... content/public/android/java/src/org/chromium/content/browser/input/AdapterInputConnection.java:362: public static int adjustLengthIfAtUtf16Boundary(CharSequence str, int index) { On 2015/06/09 03:52:53, aelias wrote: > I'd think this would be a bit clearer/more reusable as "boolean > indexBetweenUtf16SurrogatePair" and have the callers do: > > if (indexBetweenUtf16SurrogatePair(...)) > beforeLength += 1; Done.
lgtm
changwan@chromium.org changed reviewers: + tedchoc@chromium.org
Ted, could you review TestInputMethodManagerWrapper.java? Thanks.
On 2015/06/10 00:14:55, Changwan Ryu wrote: > Ted, could you review TestInputMethodManagerWrapper.java? Thanks. TestInputMethodManagerWrapper.java - lgtm I would personally just use a Pair<Integer, Integer> instead of writing the Range class, but I'm fine with it.
The CQ bit was checked by changwan@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from aurimas@chromium.org, bcwhite@chromium.org, aelias@chromium.org Link to the patchset: https://codereview.chromium.org/1165793007/#ps80001 (title: "fixed findbug error")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1165793007/80001
On 2015/06/10 00:18:17, Ted C wrote: > On 2015/06/10 00:14:55, Changwan Ryu wrote: > > Ted, could you review TestInputMethodManagerWrapper.java? Thanks. > > TestInputMethodManagerWrapper.java - lgtm > > I would personally just use a Pair<Integer, Integer> instead of > writing the Range class, but I'm fine with it. Thanks for the quick review. I wrote it for fanciness, but at the moment it's just the same as Pair<>. I'll keep that in mind and either try to use it globally or replace it with Pair<> if necessary.
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: linux_android_rel_ng on tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_android_r...)
The CQ bit was checked by changwan@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from bcwhite@chromium.org, tedchoc@chromium.org, aelias@chromium.org, aurimas@chromium.org Link to the patchset: https://codereview.chromium.org/1165793007/#ps100001 (title: "do not check updateSelection for some of the tests")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1165793007/100001
Message was sent while issue was closed.
Committed patchset #6 (id:100001)
Message was sent while issue was closed.
Patchset 6 (id:??) landed as https://crrev.com/5f6e036312bc4a978768f5b5971eee1a5ec9f272 Cr-Commit-Position: refs/heads/master@{#333704}
Message was sent while issue was closed.
A revert of this CL (patchset #6 id:100001) has been created in https://codereview.chromium.org/1173083004/ by dtrainor@chromium.org. The reason for reverting is: Failing Android Tests (dbg) builder: http://build.chromium.org/p/chromium.linux/builders/Android%20Tests%20%28dbg%... The newly added test seems to be flaky.. |