|
|
DescriptionExtract JSON-LD metadata for Copyless Paste
Copyless paste is a feature which provides a 0-state suggestion for
text input, based on the user’s recent context. We'd like to augment
the signal by adding structured metadata to it besides just URL and the
title.
BUG=691765
Review-Url: https://codereview.chromium.org/2690903005
Cr-Commit-Position: refs/heads/master@{#451160}
Committed: https://chromium.googlesource.com/chromium/src/+/3b69be515fabeb3652f74ccf02c26c62bdd88bfb
Patch Set 1 #
Total comments: 10
Patch Set 2 : address comments #
Total comments: 5
Patch Set 3 : address dcheng's comments #
Total comments: 8
Patch Set 4 : address esprehn's comments #Patch Set 5 : rename to document_metadata #
Total comments: 2
Patch Set 6 : histogram #
Messages
Total messages: 55 (26 generated)
Description was changed from ========== Metadata extraction for Copyless Paste BUG= ========== to ========== Extract metadata for Copyless Paste BUG= ==========
Patchset #1 (id:1) has been deleted
wychen@chromium.org changed reviewers: + esprehn@chromium.org
Would you mind taking a look? Thanks!
The CQ bit was checked by wychen@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Description was changed from ========== Extract metadata for Copyless Paste BUG= ========== to ========== Extract JSON-LD metadata for Copyless Paste BUG=691765 ==========
esprehn@chromium.org changed reviewers: + dglazkov@chromium.org
Can we move the entire copyless paste feature into blink instead and just use mojo IPC, I'd like to avoid having to serialize into this massive string and expose this API that scans the entire document. https://codereview.chromium.org/2690903005/diff/20001/third_party/WebKit/Sour... File third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractor.cpp (right): https://codereview.chromium.org/2690903005/diff/20001/third_party/WebKit/Sour... third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractor.cpp:25: if (!element.hasTagName(HTMLNames::scriptTag)) This doesn't understand shadow DOM, also this is walking the entire document which I'm not sure we want to do on every page load. How will this get called? https://codereview.chromium.org/2690903005/diff/20001/third_party/WebKit/Sour... third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractor.cpp:28: element.hasAttribute(HTMLNames::typeAttr) && no need to check hasAttribute, this is just doing duplicate work with the below getAttribute. If it doesn't exist you'll get a nullAtom which won't be equal to the string https://codereview.chromium.org/2690903005/diff/20001/third_party/WebKit/Sour... third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractor.cpp:33: result.append(element.textContent(false)); false is the default, remove this https://codereview.chromium.org/2690903005/diff/20001/third_party/WebKit/Sour... third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractor.cpp:45: // var a=document.querySelectorAll("script[type=\"application/ld+json\"]"); Can we store these in a map instead of scanning the entire document? https://codereview.chromium.org/2690903005/diff/20001/third_party/WebKit/Sour... third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractor.cpp:61: return ""; emptyString
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
Thanks for the prompt review! Quick reply first. https://codereview.chromium.org/2690903005/diff/20001/third_party/WebKit/Sour... File third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractor.cpp (right): https://codereview.chromium.org/2690903005/diff/20001/third_party/WebKit/Sour... third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractor.cpp:25: if (!element.hasTagName(HTMLNames::scriptTag)) On 2017/02/13 23:47:04, esprehn wrote: > This doesn't understand shadow DOM, also this is walking the entire document > which I'm not sure we want to do on every page load. How will this get called? Not understanding shadow DOM should be fine, since the JSON-LD tag should never be in one. This does walk the entire document for main frames on every page load though. I did think about the performance implication, and benchmarked this. On a typical news page on N5X, CopylessPaste.ExtractionUs is 0.07ms, which is shorter than WebCore.DistillabilityUs. In some sense, this works similarly to DocumentStatisticsCollector. They both walk the whole DOM tree, ignoring shadow DOM, but this one only gets called once per page load. https://codereview.chromium.org/2690903005/diff/20001/third_party/WebKit/Sour... third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractor.cpp:45: // var a=document.querySelectorAll("script[type=\"application/ld+json\"]"); On 2017/02/13 23:47:04, esprehn wrote: > Can we store these in a map instead of scanning the entire document? The JSON-LD data extracted here would be transient. After injecting to Icing, it can be discarded. Do you mean tightly integrating this extraction logic into the parser so that we don't have to scan the DOM again? I thought about that, but parser can have wider performance impact, while this only handles main frames, and is optional (some embedders might not need this feature).
No I mean something is going to call this method, that should just be code inside blink that either responds to a mojo IPC or is a service the renderer exposes. We shouldn't need to add this to WebDocument, blink should craft the mojo IPC directly and send it to the browser.
On 2017/02/14 00:30:22, esprehn wrote: > No I mean something is going to call this method, that should just be code > inside blink that either responds to a mojo IPC or is a service the renderer > exposes. We shouldn't need to add this to WebDocument, blink should craft the > mojo IPC directly and send it to the browser. Ah. I see. It is what I want to do in the next CL. It should be like this: On WebMeaningfulLayout::FinishedLoading, connect to Mojo service on the browser side if the JSON-LD extraction found something. On the Browser side, a service is bound to the main frame host, ideally from Java, to save from JNI string conversion. Do you think this makes sense? I'm also having difficulty finding existing code that does something similar, without V8 involved.
esprehn@chromium.org changed reviewers: + dcheng@chromium.org
I think inside LocalFrame::create() you'll want to use interfaceRegistry() to add a per frame service to expose this API to the browser process? dcheng@ can probably help there.
On 2017/02/14 00:48:32, esprehn wrote: > I think inside LocalFrame::create() you'll want to use interfaceRegistry() to > add a per frame service to expose this API to the browser process? dcheng@ can > probably help there. I'm thinking the other way around: the browser process hosts a server interface and wait for the renderer to feed it extracted metadata. Pushing it from WebKit might be more efficient.
https://codereview.chromium.org/2690903005/diff/20001/third_party/WebKit/Sour... File third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractor.cpp (right): https://codereview.chromium.org/2690903005/diff/20001/third_party/WebKit/Sour... third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractor.cpp:28: element.hasAttribute(HTMLNames::typeAttr) && On 2017/02/13 23:47:04, esprehn wrote: > no need to check hasAttribute, this is just doing duplicate work with the below > getAttribute. If it doesn't exist you'll get a nullAtom which won't be equal to > the string Done. https://codereview.chromium.org/2690903005/diff/20001/third_party/WebKit/Sour... third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractor.cpp:33: result.append(element.textContent(false)); On 2017/02/13 23:47:04, esprehn wrote: > false is the default, remove this Done. https://codereview.chromium.org/2690903005/diff/20001/third_party/WebKit/Sour... third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractor.cpp:61: return ""; On 2017/02/13 23:47:04, esprehn wrote: > emptyString Done.
How does this CL look as is? This CL only introduces the extraction logic. In the next CLs, the mojo server would be created on the Java side, and the client in Blink. Given that the scan only takes ~0.07ms, further optimization is not our top priority.
1) What is copyless paste? 2) If there are any more questions about Mojo, happy to chat about that. As it is, it looks like we can just use the InterfaceRegistry/InterfaceProvider on LocalFrame. (left some drive-by comments as well) https://codereview.chromium.org/2690903005/diff/40001/third_party/WebKit/Sour... File third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractorTest.cpp (right): https://codereview.chromium.org/2690903005/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractorTest.cpp:21: kContent = Nit: put this in the initializers section of the ctor. https://codereview.chromium.org/2690903005/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractorTest.cpp:41: String kContent; This isn't /really/ a constant, so it shouldn't be named with kStyleNaming. https://codereview.chromium.org/2690903005/diff/40001/third_party/WebKit/Sour... File third_party/WebKit/Source/modules/copyless_paste/OWNERS (right): https://codereview.chromium.org/2690903005/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/modules/copyless_paste/OWNERS:1: wychen@chromium.org Nit: ideally there should be more than one OWNER.
Description was changed from ========== Extract JSON-LD metadata for Copyless Paste BUG=691765 ========== to ========== Extract JSON-LD metadata for Copyless Paste Copyless paste is a feature which provides a 0-state suggestion for text input, based on the user’s recent context. We'd like to augment the signal by adding structured metadata to it besides just URL and the title. BUG=691765 ==========
Description was changed from ========== Extract JSON-LD metadata for Copyless Paste Copyless paste is a feature which provides a 0-state suggestion for text input, based on the user’s recent context. We'd like to augment the signal by adding structured metadata to it besides just URL and the title. BUG=691765 ========== to ========== Extract JSON-LD metadata for Copyless Paste Copyless paste is a feature which provides a 0-state suggestion for text input, based on the user’s recent context. We'd like to augment the signal by adding structured metadata to it besides just URL and the title. BUG=691765 ==========
Updated the description as well. https://codereview.chromium.org/2690903005/diff/40001/third_party/WebKit/Sour... File third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractorTest.cpp (right): https://codereview.chromium.org/2690903005/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractorTest.cpp:21: kContent = On 2017/02/14 19:32:54, dcheng wrote: > Nit: put this in the initializers section of the ctor. Done. https://codereview.chromium.org/2690903005/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractorTest.cpp:41: String kContent; On 2017/02/14 19:32:54, dcheng wrote: > This isn't /really/ a constant, so it shouldn't be named with kStyleNaming. Done.
This is looking pretty good, thanks for all of the iteration. https://codereview.chromium.org/2690903005/diff/60001/third_party/WebKit/Sour... File third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractor.cpp (right): https://codereview.chromium.org/2690903005/diff/60001/third_party/WebKit/Sour... third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractor.cpp:27: if (element.hasTagName(HTMLNames::scriptTag) && no need to check hasTagName twice right? https://codereview.chromium.org/2690903005/diff/60001/third_party/WebKit/Sour... third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractor.cpp:42: // Do JSON-LD extraction, which is equivalant to the following JavaScript code: This belongs in the header file https://codereview.chromium.org/2690903005/diff/60001/third_party/WebKit/Sour... File third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractor.h (right): https://codereview.chromium.org/2690903005/diff/60001/third_party/WebKit/Sour... third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractor.h:15: class MODULES_EXPORT CopylessPasteExtractor final { Can you add a comment about what this class does and it's purpose and how to use it? https://codereview.chromium.org/2690903005/diff/60001/third_party/WebKit/Sour... File third_party/WebKit/Source/modules/copyless_paste/OWNERS (right): https://codereview.chromium.org/2690903005/diff/60001/third_party/WebKit/Sour... third_party/WebKit/Source/modules/copyless_paste/OWNERS:1: wychen@chromium.org You're not in any webkit owners today, how many patches have you landed?
When I see the name CopylessPasteExtractor, I am confused, because I can't guess the purpose of the code unless I understand the context behind the project. Would you consider naming this module + class to better reflect its purpose? I am not sure if it is "JsonLDMetadataExtractor" or something more elegant. Also, does it need to be a separate module?
https://codereview.chromium.org/2690903005/diff/60001/third_party/WebKit/Sour... File third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractor.cpp (right): https://codereview.chromium.org/2690903005/diff/60001/third_party/WebKit/Sour... third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractor.cpp:27: if (element.hasTagName(HTMLNames::scriptTag) && On 2017/02/15 00:10:03, esprehn wrote: > no need to check hasTagName twice right? Oops. https://codereview.chromium.org/2690903005/diff/60001/third_party/WebKit/Sour... third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractor.cpp:42: // Do JSON-LD extraction, which is equivalant to the following JavaScript code: On 2017/02/15 00:10:03, esprehn wrote: > This belongs in the header file Deleted. https://codereview.chromium.org/2690903005/diff/60001/third_party/WebKit/Sour... File third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractor.h (right): https://codereview.chromium.org/2690903005/diff/60001/third_party/WebKit/Sour... third_party/WebKit/Source/modules/copyless_paste/CopylessPasteExtractor.h:15: class MODULES_EXPORT CopylessPasteExtractor final { On 2017/02/15 00:10:03, esprehn wrote: > Can you add a comment about what this class does and it's purpose and how to use > it? Done. https://codereview.chromium.org/2690903005/diff/60001/third_party/WebKit/Sour... File third_party/WebKit/Source/modules/copyless_paste/OWNERS (right): https://codereview.chromium.org/2690903005/diff/60001/third_party/WebKit/Sour... third_party/WebKit/Source/modules/copyless_paste/OWNERS:1: wychen@chromium.org On 2017/02/15 00:10:03, esprehn wrote: > You're not in any webkit owners today, how many patches have you landed? You are right. I shouldn't be the owner.
On 2017/02/15 00:23:41, dglazkov wrote: > When I see the name CopylessPasteExtractor, I am confused, because I can't guess > the purpose of the code unless I understand the context behind the project. > Would you consider naming this module + class to better reflect its purpose? I > am not sure if it is "JsonLDMetadataExtractor" or something more elegant. I did ponder on naming for a while. JsonLDMetadataExtractor would be an accurate description of what it is right now, but in V2, we want to broaden it beyond just JSON-LD. Saying "metadata extractor" or "structured data extractor" seem a bit vague as well. Furthermore, this will not be a general-purposed Schema.org extractor. It'll just extract what's useful for Copyless Paste. Therefore, I ended up choosing the project name for it. > Also, does it need to be a separate module? I heard it's the new way to do things. :p More seriously, I put the DocumentStatisticsCollector in core/, but it's not used by any other core/ features. This CL has similar properties, so it doesn't feel right to put it in core/. The size of this CL is on the smaller side though. It'll grow bigger in V2.
On 2017/02/15 at 01:18:03, wychen wrote: > On 2017/02/15 00:23:41, dglazkov wrote: > > When I see the name CopylessPasteExtractor, I am confused, because I can't guess > > the purpose of the code unless I understand the context behind the project. > > Would you consider naming this module + class to better reflect its purpose? I > > am not sure if it is "JsonLDMetadataExtractor" or something more elegant. > > I did ponder on naming for a while. JsonLDMetadataExtractor would be an accurate description of what it is right now, but in V2, we want to broaden it beyond just JSON-LD. Saying "metadata extractor" or "structured data extractor" seem a bit vague as well. Furthermore, this will not be a general-purposed Schema.org extractor. It'll just extract what's useful for Copyless Paste. Therefore, I ended up choosing the project name for it. Ok. I defer to esprehn@ on this. > > > Also, does it need to be a separate module? > > I heard it's the new way to do things. :p > More seriously, I put the DocumentStatisticsCollector in core/, but it's not used by any other core/ features. This CL has similar properties, so it doesn't feel right to put it in core/. The size of this CL is on the smaller side though. It'll grow bigger in V2. Same here.
Maybe we should create a document_metadata module and put both of them there?
On 2017/02/15 01:45:43, esprehn wrote: > Maybe we should create a document_metadata module and put both of them there? Well, technically DocumentStatisticsCollector doesn't handle metadata. It's more like feature extraction in the machine learning sense. They both scan the whole DOM more or less similarly though, but putting them in the same module solely based on this is a bit strange.
On 2017/02/15 at 01:49:19, wychen wrote: > On 2017/02/15 01:45:43, esprehn wrote: > > Maybe we should create a document_metadata module and put both of them there? > > Well, technically DocumentStatisticsCollector doesn't handle metadata. It's more like feature extraction in the machine learning sense. They both scan the whole DOM more or less similarly though, but putting them in the same module solely based on this is a bit strange. Putting them into document_metadata sounds good to me. I agree it's stats, but all we need to do here is to expand the definition of metadata :)
On 2017/02/15 01:55:50, dglazkov wrote: > On 2017/02/15 at 01:49:19, wychen wrote: > > On 2017/02/15 01:45:43, esprehn wrote: > > > Maybe we should create a document_metadata module and put both of them > there? > > > > Well, technically DocumentStatisticsCollector doesn't handle metadata. It's > more like feature extraction in the machine learning sense. They both scan the > whole DOM more or less similarly though, but putting them in the same module > solely based on this is a bit strange. > > Putting them into document_metadata sounds good to me. I agree it's stats, but > all we need to do here is to expand the definition of metadata :) Sounds like an excellent excuse for me to put these two together. :) I plan to rename the files to something like Source/modules/document_metadata/CopylessPasteExtractor.cpp in this CL, and move DocumentStatisticsCollector into Source/modules/document_metadata in a separate CL. Sounds good?
It's renamed.
lgtm
The CQ bit was checked by wychen@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
haraken@chromium.org changed reviewers: + haraken@chromium.org
(I haven't caught up the long discussion here, so sorry if this is already answered but) what's a reason you don't move the copyless-paste feature into Blink using Mojo? That way you can avoid sending a giant string from Blink to content/renderer/. https://codereview.chromium.org/2690903005/diff/100001/third_party/WebKit/Sou... File third_party/WebKit/Source/modules/document_metadata/CopylessPasteExtractor.cpp (right): https://codereview.chromium.org/2690903005/diff/100001/third_party/WebKit/Sou... third_party/WebKit/Source/modules/document_metadata/CopylessPasteExtractor.cpp:60: ("CopylessPaste.ExtractionUs", 1, 1000000, 50)); Have you already added the histogram to histograms.xml?
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
wychen@chromium.org changed reviewers: + jwd@chromium.org
+jwd@ for histograms. https://codereview.chromium.org/2690903005/diff/100001/third_party/WebKit/Sou... File third_party/WebKit/Source/modules/document_metadata/CopylessPasteExtractor.cpp (right): https://codereview.chromium.org/2690903005/diff/100001/third_party/WebKit/Sou... third_party/WebKit/Source/modules/document_metadata/CopylessPasteExtractor.cpp:60: ("CopylessPaste.ExtractionUs", 1, 1000000, 50)); On 2017/02/16 02:33:35, haraken wrote: > > Have you already added the histogram to histograms.xml? Oops. Thanks for your eagle eyes!
histograms lgtm
On 2017/02/16 02:33:36, haraken wrote: > (I haven't caught up the long discussion here, so sorry if this is already > answered but) what's a reason you don't move the copyless-paste feature into > Blink using Mojo? That way you can avoid sending a giant string from Blink to > content/renderer/. By "copyless-paste feature", do you mean the JSON-LD parsing logic and Icing interface? The JSON-LD parsing logic is already done here: https://chrome-internal-review.googlesource.com/c/319943/ AFAIK, the Icing interface is only available in Java, so we have to send to chrome/browser to the UI thread for Java.
The CQ bit was checked by wychen@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
The CQ bit was checked by wychen@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from dglazkov@chromium.org Link to the patchset: https://codereview.chromium.org/2690903005/#ps120001 (title: "histogram")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
On 2017/02/16 18:28:27, wychen wrote: > On 2017/02/16 02:33:36, haraken wrote: > > (I haven't caught up the long discussion here, so sorry if this is already > > answered but) what's a reason you don't move the copyless-paste feature into > > Blink using Mojo? That way you can avoid sending a giant string from Blink to > > content/renderer/. > > By "copyless-paste feature", do you mean the JSON-LD parsing logic and Icing > interface? > > The JSON-LD parsing logic is already done here: > https://chrome-internal-review.googlesource.com/c/319943/ > > AFAIK, the Icing interface is only available in Java, so we have to send to > chrome/browser to the UI thread for Java. Makes sense. LGTM.
CQ is committing da patch. Bot data: {"patchset_id": 120001, "attempt_start_ts": 1487288022785210, "parent_rev": "a993acdd3083225868d472a30a2dc6d6c043c7dc", "commit_rev": "3b69be515fabeb3652f74ccf02c26c62bdd88bfb"}
Message was sent while issue was closed.
Description was changed from ========== Extract JSON-LD metadata for Copyless Paste Copyless paste is a feature which provides a 0-state suggestion for text input, based on the user’s recent context. We'd like to augment the signal by adding structured metadata to it besides just URL and the title. BUG=691765 ========== to ========== Extract JSON-LD metadata for Copyless Paste Copyless paste is a feature which provides a 0-state suggestion for text input, based on the user’s recent context. We'd like to augment the signal by adding structured metadata to it besides just URL and the title. BUG=691765 Review-Url: https://codereview.chromium.org/2690903005 Cr-Commit-Position: refs/heads/master@{#451160} Committed: https://chromium.googlesource.com/chromium/src/+/3b69be515fabeb3652f74ccf02c2... ==========
Message was sent while issue was closed.
Committed patchset #6 (id:120001) as https://chromium.googlesource.com/chromium/src/+/3b69be515fabeb3652f74ccf02c2... |