|
|
Created:
5 years, 7 months ago by Daniel Bratell Modified:
5 years, 7 months ago CC:
blink-reviews, oilpan-reviews Base URL:
https://chromium.googlesource.com/chromium/blink.git@master Target Ref:
refs/heads/master Project:
blink Visibility:
Public. |
DescriptionShrinking EventPath to improve memory usage and performance
In bug 325460 EventPath was changed to have fewer allocations at the cost
of higher memory usage to improve performance. This is a followup that
adds one more allocation for lower memory usage and even higher (3-5%)
Event dispatching performance.
The EventPath memory usage becomes an issue because even though there is,
most of the time, only one active Event. Due to slowness or garbage collection
delays there can actually be dozens or hundreds of Event objects in memory. In
those cases spending 1-2 KB per EventPath ends up consuming 100-200 KB memory.
The patch changes the EventPath builder to gather the path on the stack and
then building a perfectly sized EventPath. It seems to save about 1200 bytes per
event in my adhoc testing (typical event path depth is about 10-12) and
performance tests report:
Events/EventsDispatching 3.2% faster
Events/EventsDispatchingInDeeplyNestedShadowTrees 3.2% faster
Events/EventsDispatchingInShadowTrees 5.5% faster
(Confirmed through multiple runs)
Real world memory usage saving depends on the number of live Event objects. That
will vary.
R=hayato@chromium.org,sigbjornf@opera.com,rbyers@chromium.org,esprehn@chromium.org
BUG=487656, 325460
Committed: https://src.chromium.org/viewvc/blink?view=rev&revision=195651
Patch Set 1 #
Total comments: 4
Patch Set 2 : Addressed review comments. #Patch Set 3 : Really addressed the review comments. #Patch Set 4 : Oilpan performance (Member<> is not free of cost as raw pointers) #Patch Set 5 : Added comment. #
Messages
Total messages: 33 (7 generated)
bratell@opera.com changed reviewers: + hayato@chromium.org, rbyers@chromium.org
Please take a look.
LGTM https://codereview.chromium.org/1136953011/diff/1/Source/core/events/EventPat... File Source/core/events/EventPath.cpp (right): https://codereview.chromium.org/1136953011/diff/1/Source/core/events/EventPat... Source/core/events/EventPath.cpp:132: ASSERT(m_nodeEventContexts.size() == 0); We don't need this ASSERT since L94 already asserts it.
sigbjornf@opera.com changed reviewers: + sigbjornf@opera.com
https://codereview.chromium.org/1136953011/diff/1/Source/core/events/EventPat... File Source/core/events/EventPath.cpp (right): https://codereview.chromium.org/1136953011/diff/1/Source/core/events/EventPat... Source/core/events/EventPath.cpp:97: Vector<Node*, 256> nodesInPath; Make this WillBeHeapVector<RawPtrWillBeMember<Node>>. And do you really need this big a local allocation -- what's performance like if you keep addNodeEventContext() like before, but do not keep a vector with an inline capacity in m_nodeEventContexts.
On 2015/05/19 07:16:19, sof wrote: > what's performance like if > you keep addNodeEventContext() like before, but do not keep a vector with an > inline capacity in m_nodeEventContexts. It loses 12-20% on the Event dispatching performance tests. > And do you really need this big a local allocation -- Since the idea is to avoid heap allocations I just made the stack buffer large enough. At least in a pre-oilpan world its size really doesn't matter as long as it won't make us run out of stack and it won't. Now I guess it could become ugly with Member<Node*> since that is no longer a raw pointer? What about keeping it a raw pointer since all elements we work on here are reachable from other places so one reference more or less won't add anything?
https://codereview.chromium.org/1136953011/diff/1/Source/core/events/EventPat... File Source/core/events/EventPath.cpp (right): https://codereview.chromium.org/1136953011/diff/1/Source/core/events/EventPat... Source/core/events/EventPath.cpp:97: Vector<Node*, 256> nodesInPath; On 2015/05/19 07:16:19, sof wrote: > Make this WillBeHeapVector<RawPtrWillBeMember<Node>>. > > And do you really need this big a local allocation -- what's performance like if > you keep addNodeEventContext() like before, but do not keep a vector with an > inline capacity in m_nodeEventContexts. Done. https://codereview.chromium.org/1136953011/diff/1/Source/core/events/EventPat... Source/core/events/EventPath.cpp:132: ASSERT(m_nodeEventContexts.size() == 0); On 2015/05/19 06:45:29, hayato wrote: > We don't need this ASSERT since L94 already asserts it. Done.
Patchset #4 (id:60001) has been deleted
On 2015/05/19 08:11:14, Daniel Bratell wrote: > Now I guess it could become ugly with Member<Node*> since that is no longer a > raw pointer? What about keeping it a raw pointer since all elements we work on > here are reachable from other places so one reference more or less won't add > anything? Yes, Member<Node*> makes the large stack vector slowish (loses 1% of the win). I've changed the Vector stack size from 256 to 64 to not make oilpan hurt but I would still prefer raw pointers. They are cheaper.
On 2015/05/19 08:19:10, Daniel Bratell wrote: > On 2015/05/19 08:11:14, Daniel Bratell wrote: > > > Now I guess it could become ugly with Member<Node*> since that is no longer a > > raw pointer? What about keeping it a raw pointer since all elements we work on > > here are reachable from other places so one reference more or less won't add > > anything? > > Yes, Member<Node*> makes the large stack vector slowish (loses 1% of the win). > I've changed the Vector stack size from 256 to 64 to not make oilpan hurt but I > would still prefer raw pointers. They are cheaper. I'm not sure you understand what Member<> expands to. EventDispatching is considerably faster with Oilpan enabled (~80%), so I wouldn't be too concerned about creating a temporary vector like this.
On 2015/05/19 08:22:37, sof wrote: > On 2015/05/19 08:19:10, Daniel Bratell wrote: > > would still prefer raw pointers. They are cheaper. > > I'm not sure you understand what Member<> expands to. I do. A class with constructors which means that Vector has to use loops that invoke that (inlined) constructor instead of doing nothing or using memset or memcpy. That is why Vector<Member<Foo>> becomes slower than Vector<Foo*>.
On 2015/05/19 08:51:04, Daniel Bratell wrote: > On 2015/05/19 08:22:37, sof wrote: > > On 2015/05/19 08:19:10, Daniel Bratell wrote: > > > would still prefer raw pointers. They are cheaper. > > > > I'm not sure you understand what Member<> expands to. > > I do. A class with constructors which means that Vector has to use loops that > invoke that (inlined) constructor instead of doing nothing or using memset or > memcpy. > > That is why Vector<Member<Foo>> becomes slower than Vector<Foo*>. Good, a shim around a raw pointer it merely is; done so as to keep track of these in a tidier manner. Backing stores aren't constructor initialized on allocation, so what loop would this be? It seems reasonable to assume that current-day compilers should be able to inline away a constructor that merely initialized its single field, but perhaps not?
On 2015/05/19 09:21:25, sof wrote: > On 2015/05/19 08:51:04, Daniel Bratell wrote: > > On 2015/05/19 08:22:37, sof wrote: > > > On 2015/05/19 08:19:10, Daniel Bratell wrote: > > > > would still prefer raw pointers. They are cheaper. > > > > > > I'm not sure you understand what Member<> expands to. > > > > I do. A class with constructors which means that Vector has to use loops that > > invoke that (inlined) constructor instead of doing nothing or using memset or > > memcpy. > > > > That is why Vector<Member<Foo>> becomes slower than Vector<Foo*>. > > Good, a shim around a raw pointer it merely is; done so as to keep track of > these in a tidier manner. Backing stores aren't constructor initialized on > allocation, so what loop would this be? > > It seems reasonable to assume that current-day compilers should be able to > inline away a constructor that merely initialized its single field, but perhaps > not? This is actually done on a higher level. In both STL containers and in WTF containers, the objects are checked with type trait APIs and depending on what the type trait APIs respond different sets of C++ code is chosen. Improving the type trait usage in WTF::Vector (which I did for footprint purposes) gave a 1% overall win in blink performance so that optimization is quite valuable. If you want to read some code, there is Source/wtf/VectorTraits.h which uses code in TypeTraits.h to inspect the types. There are already some hacks to optimize RefPtr and OwnPtr here and I think adding Member to that list might prove beneficial. I've filed http://crbug.com/489625 for it.
Leaving it as 64 for now, makes sense - parity with how it was represented on EventPath previously. Looks fine wrt Oilpan. ( +Cc: esprehn )
Seems reasonable to me, LGTM Are there memory consumption tests running automatically that lock in the win here?
On 2015/05/19 19:09:32, Rick Byers wrote: > Seems reasonable to me, LGTM > > Are there memory consumption tests running automatically that lock in the win > here? I don't think so. There are generic memory usage tests but they all seem to be reactive, and I think lots of changes end up below the noise level. It's always been difficult to measure and track memory usage which is most likely a large factor in its steady growth. Our very best (most stable and reliable) memory usage results come from using special devices with no other programs and restarting the device after each run. Not scalable and with a long delay from code change to data. So long that mapping results back to code change is quite difficult.
esprehn@chromium.org changed reviewers: + esprehn@chromium.org
I'm not very comfortable landing this change, this seems like a hack around Oilpan's collector needing work. If I came across this code later I'd just undo your change. This Vector should just store OwnPtr's to NodeEventContext, these objects are too big to be storing the Vector directly.
On 2015/05/19 22:00:32, esprehn wrote: > I'm not very comfortable landing this change, this seems like a hack around > Oilpan's collector needing work. If I came across this code later I'd just undo > your change. > > This Vector should just store OwnPtr's to NodeEventContext, these objects are > too big to be storing the Vector directly. I'm a bit confused. Oilpan is not enabled for EventPath on trunk, so the performance improvement or the memory reduction mentioned in the CL description is not related to Oilpan, is it? BTW, by enabling Oilpan, the EventDispatching benchmarks are going to be much, much faster: blink_perf.events EventsDispatchingInDeeplyNestedShadowTrees runs/s GOOD +19.5% blink_perf.events EventsDispatchingInShadowTrees runs/s GOOD +21.6% blink_perf.events EventsDispatching runs/s GOOD +63.0% https://docs.google.com/spreadsheets/d/1R7PmKDvGgMO9rDsBMyCQMPwjroUm5QSWF7dVB...
I'm skeptical about your reasoning about the performance improvement and the memory reduction, but the change itself looks reasonable. LGTM.
Is event dispatch really that much faster with Oilpan? You're just deferring lots of work to a GC later, if you include the time to GC all the garbage is it still faster? I'm very skeptical about all these claims of better perf with Oilpan, how long is the giant GC after the benchmark is done?
On 2015/05/19 23:50:00, esprehn wrote: > Is event dispatch really that much faster with Oilpan? You're just deferring > lots of work to a GC later, if you include the time to GC all the garbage is it > still faster? I'm very skeptical about all these claims of better perf with > Oilpan, how long is the giant GC after the benchmark is done? The results are real -- the performance gain has been stable for a couple of months in the oilpan build. The performance gain mainly comes from the allocation scheme of Vector & HashTable backing storages we implemented in Oilpan's allocator. The allocation scheme is highly optimized so that: - vector.expand and hashtable.expand doesn't need to reallocate backing storages in common cases. - the memory of the backing storages are promptly freed and reused without involving a GC (Oilpan has a mechanism of coalescing the promptly-freed backing storages withotu involving a GC). In those EventDispatching benchmarks, almost all backing storages are collected by the coalescing. No heavy GC happens as far as I observe in about:tracing.
I'm confused: esprehn: This has nothing to do with oilpan. All testing was done on trunk with oilpan disabled. The memory usage problems are on trunk. Whether there will be more or fewer parallel Event objects with oilpan I do not know so the impact this has on an oilpan enabled Blink I'm not sure. It should not make anything worse though. The code I'm fixing could probably look different but it gave a 20% performance boost and that convinced the reviewer at the time that this was better than what proceeded it. haraken: "I'm skeptical about your reasoning about the performance improvement and the memory reduction..." - What makes you skeptical? The performance improvement numbers come from measurements with the performance benchmarks. Depending on platform, compiler, allocator and other factors the results may of course vary but on a Release content_shell x86 built for profiling the results are consistently: Events/EventDispatching Before: 815 ±12 runs/s. After: 846 ±3 runs/s. Events/EventsDispatchingInDeeplyNestedShadowTrees: Before: 9410 ±10 runs/s. After: 9733 ±15 runs/s. Events/EventsDispatchingInShadowTrees: Before: 20950 ±40 runs/s. After: 22125 ±40 runs/s. The improvements in this particular build setup (content_shell x86_64 etc etc) are fully repeatable (3.8% faster, 3.4% faster, 5.6% faster). I didn't do this change for the performance improvement but if things get faster, I'll accept it. Gift horse and all that. For the memory reduction numbers, the base data is this report from loading http://youtube.com/tv ==26175== -------------------- 28 of 200 -------------------- ==26175== max-live: 144,872 in 91 blocks ==26175== tot-alloc: 265,864 in 167 blocks (avg size 1592.00) ==26175== deaths: 148, at avg age 223,886,445 (0.75% of prog lifetime) ==26175== acc-ratios: 0.26 rd, 0.09 wr (69,672 b-read, 26,520 b-written) ==26175== at 0x4C2D92A: operator new(unsigned long) ==26175== by 0x134EC1D: blink::Event::initEventPath(blink::Node&) ==26175== by 0x134F122: blink::EventDispatcher::dispatchEvent(blink::Node&, WTF::PassRefPtr<blink::EventDispatchMediator>) ==26175== by 0x131B7A8: blink::Node::dispatchEvent(WTF::PassRefPtr<blink::Event>) Interesting numbers here are "max-live" 91 blocks. So there was for some reason 91 EventPath objects live at some time, consuming a total of 144872 bytes. Of all the data allocated (265864 bytes) only 26% were read and 9% written indicating a buffer that is much larger than is actually needed. I ran some adhoc mouse event tests on various pages (youtube, cnn, google.com, a japanese tourist site) since I believe mouse events are the ones most common and with non-trivial event paths. The most I found was a visible element I could hover at depth 16 and most were depth 8-12. The number per object reported is 1592 bytes which corresponds to the 64 element vector of objects containing 3 pointers each (64 * 3 * 8 is 1536). With this patch that will be 56 bytes plus the size of the custom allocation which, assuming a depth of 12 will be 288 bytes + heap overhead. That lands us at roughly 1200 fewer bytes allocated per object. Since the objects are smaller than one page (4 KB) we can assume that the unused memory is committed as well so that we save real memory and not just virtual memory. The biggest question is how many Event objects will be live at any one time. The data above shows that at one time at youtube there was 91 live Event objects, but considering that is a peak, most of the time the count will be lower. On the other hand, it possible for a document to keep Events alive forever and I assume that will also keep the EventPaths alive forever so 91 is in no way an absolute limit. As for oilpan, that might change how many Event objects are alive at peak and at average as well.
Can we just use an array of ownptr instead? The bug here is that we create a vector of big objects. In general we don't do that in Blink for this reason and use pointers instead. I'd much rather the few line change than this unless you have numbers that shows this is much better. On May 20, 2015 1:40 AM, <bratell@opera.com> wrote: > I'm confused: > > esprehn: This has nothing to do with oilpan. All testing was done on trunk > with > oilpan disabled. The memory usage problems are on trunk. Whether there > will be > more or fewer parallel Event objects with oilpan I do not know so the > impact > this has on an oilpan enabled Blink I'm not sure. It should not make > anything > worse though. The code I'm fixing could probably look different but it > gave a > 20% performance boost and that convinced the reviewer at the time that > this was > better than what proceeded it. > > haraken: "I'm skeptical about your reasoning about the performance > improvement > and the memory reduction..." - What makes you skeptical? The performance > improvement numbers come from measurements with the performance benchmarks. > Depending on platform, compiler, allocator and other factors the results > may of > course vary but on a Release content_shell x86 built for profiling the > results > are consistently: > > Events/EventDispatching Before: 815 ±12 runs/s. After: 846 ±3 runs/s. > Events/EventsDispatchingInDeeplyNestedShadowTrees: Before: 9410 ±10 > runs/s. > After: 9733 ±15 runs/s. > Events/EventsDispatchingInShadowTrees: Before: 20950 ±40 runs/s. After: > 22125 > ±40 runs/s. > > The improvements in this particular build setup (content_shell x86_64 etc > etc) > are fully repeatable (3.8% faster, 3.4% faster, 5.6% faster). I didn't do > this > change for the performance improvement but if things get faster, I'll > accept it. > Gift horse and all that. > > For the memory reduction numbers, the base data is this report from loading > http://youtube.com/tv > ==26175== -------------------- 28 of 200 -------------------- > ==26175== max-live: 144,872 in 91 blocks > ==26175== tot-alloc: 265,864 in 167 blocks (avg size 1592.00) > ==26175== deaths: 148, at avg age 223,886,445 (0.75% of prog lifetime) > ==26175== acc-ratios: 0.26 rd, 0.09 wr (69,672 b-read, 26,520 b-written) > ==26175== at 0x4C2D92A: operator new(unsigned long) > ==26175== by 0x134EC1D: blink::Event::initEventPath(blink::Node&) > ==26175== by 0x134F122: > blink::EventDispatcher::dispatchEvent(blink::Node&, > WTF::PassRefPtr<blink::EventDispatchMediator>) > ==26175== by 0x131B7A8: > blink::Node::dispatchEvent(WTF::PassRefPtr<blink::Event>) > > Interesting numbers here are "max-live" 91 blocks. So there was for some > reason > 91 EventPath objects live at some time, consuming a total of 144872 bytes. > Of > all the data allocated (265864 bytes) only 26% were read and 9% written > indicating a buffer that is much larger than is actually needed. I ran some > adhoc mouse event tests on various pages (youtube, cnn, google.com, a > japanese > tourist site) since I believe mouse events are the ones most common and > with > non-trivial event paths. The most I found was a visible element I could > hover at > depth 16 and most were depth 8-12. > > The number per object reported is 1592 bytes which corresponds to the 64 > element > vector of objects containing 3 pointers each (64 * 3 * 8 is 1536). With > this > patch that will be 56 bytes plus the size of the custom allocation which, > assuming a depth of 12 will be 288 bytes + heap overhead. That lands us at > roughly 1200 fewer bytes allocated per object. Since the objects are > smaller > than one page (4 KB) we can assume that the unused memory is committed as > well > so that we save real memory and not just virtual memory. > > The biggest question is how many Event objects will be live at any one > time. The > data above shows that at one time at youtube there was 91 live Event > objects, > but considering that is a peak, most of the time the count will be lower. > On the > other hand, it possible for a document to keep Events alive forever and I > assume > that will also keep the EventPaths alive forever so 91 is in no way an > absolute > limit. As for oilpan, that might change how many Event objects are alive > at peak > and at average as well. > > > https://codereview.chromium.org/1136953011/ > > To unsubscribe from this group and stop receiving emails from it, send an > email to blink-reviews+unsubscribe@chromium.org. > To unsubscribe from this group and stop receiving emails from it, send an email to blink-reviews+unsubscribe@chromium.org.
On 2015/05/20 09:07:03, esprehn wrote: > Can we just use an array of ownptr instead? The bug here is that we create > a vector of big objects. In general we don't do that in Blink for this > reason and use pointers instead. I'd much rather the few line change than > this unless you have numbers that shows this is much better. Sounds like you are really arguing to revert the fix for bug 325460 that landed last year? That one claimed this basic structure made event dispatching 20% faster than what preceded it through much fewer allocations and I have had no reason to doubt those claims. As for switching to Vector<OwnPtr<EventContext>> instead of Vector<EventContext>: EventContext is 3 pointers while OwnPtr is 1 pointer so they are not that different. A 12 element path will be 12 * heap overhead + 800 bytes with OwnPtr, 288 bytes with this patch (and 1500+ bytes with the current code). A 30 element path (30 being the deepest seen trees on normal sites) will be 30 * heap overhead + 1232 bytes with OwnPtr, 1080 bytes with this patch and still 1500+ bytes with the current code. All numbers for 64 bit, but since it's all pointers, 32 bit is just half of everything. I think this is a good improvement and that we should land it if we don't know of anything that will be worse with it. Again, oilpan will change things since it will make all these objects garbage collected by the oilpan garbage collector which means they will have different lifespans and I assume would be out. I believe they have to be oilpanified so that the garbage collector can find the leaf Node pointers but please don't quote me on that. Allocating the right size Vector should be equally valuable in oilpan as without oilpan though.
esprehn@chromium.org changed reviewers: + haraken@chromium.org
lgtm, please add a comment to the code about why you do this. Someone is going to undo it this unless there's an explanation for it.
On 2015/05/20 17:31:00, esprehn wrote: > lgtm, please add a comment to the code about why you do this. Someone is going > to undo it this unless there's an explanation for it. Hmm. I love explaining in the code why it is doing non-obvious things. 99% (possibly 100%) of the times I do that reviewers demand that I remove the comments because people can use git history. I absolutely don't mind adding comments since I believe that too much code is non-obvious unless you wrote it yourself recently, but I wish more people asked me to add explanations instead of removing them.
On 2015/05/20 at 20:36:15, bratell wrote: > On 2015/05/20 17:31:00, esprehn wrote: > > lgtm, please add a comment to the code about why you do this. Someone is going > > to undo it this unless there's an explanation for it. > > Hmm. I love explaining in the code why it is doing non-obvious things. 99% (possibly 100%) of the times I do that reviewers demand that I remove the comments because people can use git history. I absolutely don't mind adding comments since I believe that too much code is non-obvious unless you wrote it yourself recently, but I wish more people asked me to add explanations instead of removing them. Non-obvious comments are fine, just writing what the code does is not. // Reduce memory bloat by sizing the Vector perfectly. OwnPtr in the Vector was a perf regression. is fine.
The CQ bit was checked by bratell@opera.com
The patchset sent to the CQ was uploaded after l-g-t-m from hayato@chromium.org, haraken@chromium.org, esprehn@chromium.org, rbyers@chromium.org Link to the patchset: https://codereview.chromium.org/1136953011/#ps100001 (title: "Added comment.")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1136953011/100001
Message was sent while issue was closed.
Committed patchset #5 (id:100001) as https://src.chromium.org/viewvc/blink?view=rev&revision=195651 |