|
|
Created:
4 years, 10 months ago by haraken Modified:
4 years, 9 months ago Reviewers:
sof, jochen (gone - plz use gerrit), esprehn, Hannes Payer (out of office), Yuta Kitamura, oilpan-reviews CC:
chromium-reviews, oilpan-reviews, Mads Ager (chromium), blink-reviews, kinuko+watch, kouhei+heap_chromium.org Base URL:
https://chromium.googlesource.com/chromium/src.git@master Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
DescriptionOilpan: Reduce the reserved size of CallbackStacks
CallbackStacks are used only while Oilpan's GC is doing marking & weak processing.
However, currently each CallbackStack continues retaining one Block forever.
This wastes memory a lot. Oilpan has 4 global CallbackStacks and 1 CallbackStack per thread.
Each Block consumes 8192 * sizeof(Item) = 128 KB. This means that Oilpan wastes 128 KB * (4 + # of threads).
When I start Chrome's new tab page, it creates 18 CallbackStacks,
meaning that it wastes 128 KB * 18 = 2.3 MB of memory.
This CL reduces the reserved size down to 350 KB.
BUG=
Patch Set 1 #Patch Set 2 : #
Total comments: 1
Messages
Total messages: 27 (4 generated)
Description was changed from ========== Oilpan: Reduce the reserved size of CallbackStacks CallbackStacks are used only while Oilpan's GC is doing marking & weak processing. However, currently each CallbackStack continues retaining one Block forever. This wastes memory a lot. Oilpan has 4 global CallbackStacks and 1 CallbackStack per thread. Each Block consumes 8192 * sizeof(Item) = 128 KB. This means that Oilpan wastes 128 KB * (4 + # of threads). When I start Chrome's new tab page, it creates 18 CallbackStacks, meaning that it wastes 128 KB * 18 = 2.3 MB of memory. This CL reduces the reserved size down to MB. BUG= ========== to ========== Oilpan: Reduce the reserved size of CallbackStacks CallbackStacks are used only while Oilpan's GC is doing marking & weak processing. However, currently each CallbackStack continues retaining one Block forever. This wastes memory a lot. Oilpan has 4 global CallbackStacks and 1 CallbackStack per thread. Each Block consumes 8192 * sizeof(Item) = 128 KB. This means that Oilpan wastes 128 KB * (4 + # of threads). When I start Chrome's new tab page, it creates 18 CallbackStacks, meaning that it wastes 128 KB * 18 = 2.3 MB of memory. This CL reduces the reserved size down to 350 KB. BUG= ==========
haraken@chromium.org changed reviewers: + oilpan-reviews@chromium.org, sigbjornf@opera.com
PTAL (Another idea would be to create a pool for Blocks, but I'm not really happy about introducing the complexity until we have data that justifies the complexity.)
I would argue that it misses the point somewhat to view these stacks as "wasted" memory -- we don't normally consider a thread stack (8M?) to be a wasted pre-allocation. If any of these stacks are oversized wrt to their actual use, then we should reduce them. https://codereview.chromium.org/1695653004/diff/20001/third_party/WebKit/Sour... File third_party/WebKit/Source/platform/heap/Heap.cpp (right): https://codereview.chromium.org/1695653004/diff/20001/third_party/WebKit/Sour... third_party/WebKit/Source/platform/heap/Heap.cpp:161: s_markingStack = new CallbackStack(4096); These are fairly arbitrary numbers, are they based on experiments showing that these are reasonable averages / N-percentile of what's commonly seen?
On 2016/02/14 21:47:04, sof wrote: > I would argue that it misses the point somewhat to view these stacks as "wasted" > memory -- we don't normally consider a thread stack (8M?) to be a wasted > pre-allocation. > > If any of these stacks are oversized wrt to their actual use, then we should > reduce them. > > https://codereview.chromium.org/1695653004/diff/20001/third_party/WebKit/Sour... > File third_party/WebKit/Source/platform/heap/Heap.cpp (right): > > https://codereview.chromium.org/1695653004/diff/20001/third_party/WebKit/Sour... > third_party/WebKit/Source/platform/heap/Heap.cpp:161: s_markingStack = new > CallbackStack(4096); > These are fairly arbitrary numbers, are they based on experiments showing that > these are reasonable averages / N-percentile of what's commonly seen? As far as I run telemetry benchmarks, it looks common that the marking stack exceeds the 4096 limit, or even the current 8192 limit. If we really want to make sure that the CallbackStack's allocation doesn't fail, we need to pre-allocate a fairly large amount of memory for the CallbackStacks (e.g., 4 MB per CallbackStack). However, that increases the memory pressure when a GC is not running. We should consider a way to minimize the pressure. Pre-allocating big CallbackStacks would be the most straightforward solution to prevent a near-oom-GC from failing. However, there would be other solutions that prevent a near-oom-GC from failing but consume less memory. For example: - (If you're concerned about out-of-address-space, rather than out-of-physical-memory,) reserve a 4 MB of address space per CallbackStack. Keep only the first 4196 entries resident and decommit other entries. - Create a pool of Blocks that can be shared by the CallbackStacks. - Near-oom scenarios must be handled by the MemoryPressureListener (V8 GCs and Oilpan GCs are triggered by the memory pressure notifications). We can make the MemoryPressureListener purge other memory before triggering V8 GCs and Oilpan GCs. Then the purging will release memory which can be used by the Oilpan GCs. For example, just calling partitionPurgeMemory will release a bunch of memory. That said, as I mentioned before, MemoryPressureListener is not yet fully working. This means that we don't yet even have an infrastructure to trigger Oilpan GCs in a near-oom scenario (the tab-discarding team is now working on this). That is the problem we should solve first before worrying about the allocation failure of the CallbackStacks -- IMHO, it seems premature to worry about the allocation failure before fixing the larger problem and getting data that shows that the allocation failure is a problem in a real world. FWIW, V8 pre-allocates only 256 KB of memory for the marking stack.
On 2016/02/15 01:28:06, haraken wrote: > ... > > FWIW, V8 pre-allocates only 256 KB of memory for the marking stack. Been looking at that implementation to understand it some; it commits 256k, but reserves 4M for its marking deque. With the not-unusual bailout handling on marking stack overflow of just giving up and considered the rest marked&reachable rather than attempt to extend the stack further. Oilpan could do something similar to avoid allocations during marking. Reducing the initial size of the ephemeron stack seems reasonable without upsetting stability.
On 2016/02/16 13:48:37, sof wrote: > On 2016/02/15 01:28:06, haraken wrote: > > > ... > > > > FWIW, V8 pre-allocates only 256 KB of memory for the marking stack. > > Been looking at that implementation to understand it some; it commits 256k, but > reserves 4M for its marking deque. With the not-unusual bailout handling on > marking stack overflow of just giving up and considered the rest > marked&reachable rather than attempt to extend the stack further. Oilpan could > do something similar to avoid allocations during marking. > > Reducing the initial size of the ephemeron stack seems reasonable without > upsetting stability. I want to reduce the size of CallbackStacks by the next branch cut. In conclusion, what's your proposal?
On 2016/02/22 11:38:00, haraken wrote: > On 2016/02/16 13:48:37, sof wrote: > > On 2016/02/15 01:28:06, haraken wrote: > > > > > ... > > > > > > FWIW, V8 pre-allocates only 256 KB of memory for the marking stack. > > > > Been looking at that implementation to understand it some; it commits 256k, > but > > reserves 4M for its marking deque. With the not-unusual bailout handling on > > marking stack overflow of just giving up and considered the rest > > marked&reachable rather than attempt to extend the stack further. Oilpan could > > do something similar to avoid allocations during marking. > > > > Reducing the initial size of the ephemeron stack seems reasonable without > > upsetting stability. > > I want to reduce the size of CallbackStacks by the next branch cut. In > conclusion, what's your proposal? Reduce the ephemeron stack a bit; anything else wouldn't be wise in my view.
On 2016/02/22 11:39:57, sof wrote: > On 2016/02/22 11:38:00, haraken wrote: > > On 2016/02/16 13:48:37, sof wrote: > > > On 2016/02/15 01:28:06, haraken wrote: > > > > > > > ... > > > > > > > > FWIW, V8 pre-allocates only 256 KB of memory for the marking stack. > > > > > > Been looking at that implementation to understand it some; it commits 256k, > > but > > > reserves 4M for its marking deque. With the not-unusual bailout handling on > > > marking stack overflow of just giving up and considered the rest > > > marked&reachable rather than attempt to extend the stack further. Oilpan > could > > > do something similar to avoid allocations during marking. > > > > > > Reducing the initial size of the ephemeron stack seems reasonable without > > > upsetting stability. > > > > I want to reduce the size of CallbackStacks by the next branch cut. In > > conclusion, what's your proposal? > > Reduce the ephemeron stack a bit; anything else wouldn't be wise in my view. What about s_postMarkingCallbackStack? Reserving 2.3 MB of memory just for oilpan is not acceptable anyway. If you really don't want to just reduce the size and collect UMAs, I can implement whatever complicated mechanism (e.g., block pools, bailout a GC when the marking stack overflows etc) to reduce the sizeof(CallbackStacks), but what do you want exactly?
On 2016/02/22 11:52:30, haraken wrote: > On 2016/02/22 11:39:57, sof wrote: > > On 2016/02/22 11:38:00, haraken wrote: > > > On 2016/02/16 13:48:37, sof wrote: > > > > On 2016/02/15 01:28:06, haraken wrote: > > > > > > > > > ... > > > > > > > > > > FWIW, V8 pre-allocates only 256 KB of memory for the marking stack. > > > > > > > > Been looking at that implementation to understand it some; it commits > 256k, > > > but > > > > reserves 4M for its marking deque. With the not-unusual bailout handling > on > > > > marking stack overflow of just giving up and considered the rest > > > > marked&reachable rather than attempt to extend the stack further. Oilpan > > could > > > > do something similar to avoid allocations during marking. > > > > > > > > Reducing the initial size of the ephemeron stack seems reasonable without > > > > upsetting stability. > > > > > > I want to reduce the size of CallbackStacks by the next branch cut. In > > > conclusion, what's your proposal? > > > > Reduce the ephemeron stack a bit; anything else wouldn't be wise in my view. > > What about s_postMarkingCallbackStack? > I've found the depths for that stack to be considerable; have you investigated what's typically encountered? > Reserving 2.3 MB of memory just for oilpan is not acceptable anyway. If you > really don't want to just reduce the size and collect UMAs, I can implement > whatever complicated mechanism (e.g., block pools, bailout a GC when the marking > stack overflows etc) to reduce the sizeof(CallbackStacks), but what do you want > exactly? I don't want to compromise system stability; the size you quote compares favorably to what v8 reserves. "Hiding" a fixed reservation by attempting to do it for every GC, which proposals here reduce to afaict, I honestly don't see the merit of. But I've said that before. Should there be hard limits on stack sizes for which the marking process just bails out on or tries to recover from? It's worth considering, but that's extra complexity that would be mostly orthogonal to your direct concerns here. Have you consulted v8 GC people about the nature of the change proposed? Their judgment/input would be valuable to have, regardless of conclusion.
On 2016/02/22 12:26:16, sof wrote: > On 2016/02/22 11:52:30, haraken wrote: > > On 2016/02/22 11:39:57, sof wrote: > > > On 2016/02/22 11:38:00, haraken wrote: > > > > On 2016/02/16 13:48:37, sof wrote: > > > > > On 2016/02/15 01:28:06, haraken wrote: > > > > > > > > > > > ... > > > > > > > > > > > > FWIW, V8 pre-allocates only 256 KB of memory for the marking stack. > > > > > > > > > > Been looking at that implementation to understand it some; it commits > > 256k, > > > > but > > > > > reserves 4M for its marking deque. With the not-unusual bailout handling > > on > > > > > marking stack overflow of just giving up and considered the rest > > > > > marked&reachable rather than attempt to extend the stack further. Oilpan > > > could > > > > > do something similar to avoid allocations during marking. > > > > > > > > > > Reducing the initial size of the ephemeron stack seems reasonable > without > > > > > upsetting stability. > > > > > > > > I want to reduce the size of CallbackStacks by the next branch cut. In > > > > conclusion, what's your proposal? > > > > > > Reduce the ephemeron stack a bit; anything else wouldn't be wise in my view. > > > > What about s_postMarkingCallbackStack? > > > > I've found the depths for that stack to be considerable; have you investigated > what's typically encountered? > > > Reserving 2.3 MB of memory just for oilpan is not acceptable anyway. If you > > really don't want to just reduce the size and collect UMAs, I can implement > > whatever complicated mechanism (e.g., block pools, bailout a GC when the > marking > > stack overflows etc) to reduce the sizeof(CallbackStacks), but what do you > want > > exactly? > > I don't want to compromise system stability; the size you quote compares > favorably to what v8 reserves. "Hiding" a fixed reservation by attempting to do > it for every GC, which proposals here reduce to afaict, I honestly don't see > the merit of. But I've said that before. If you worry about the system stability, I'd argue the other way. On Android, a (foreground) renderer may be forcibly killed without receiving any pressure notification. If we waste the 2.3 MB of memory, we increase the risk of the force-kill. If this CL doesn't cause any allocation failure in reality, this CL is doing a better thing. That's why I'm saying we should get data to see if the allocation failure becomes a real issue or not. > > Should there be hard limits on stack sizes for which the marking process just > bails out on or tries to recover from? It's worth considering, but that's extra > complexity that would be mostly orthogonal to your direct concerns here. > > Have you consulted v8 GC people about the nature of the change proposed? Their > judgment/input would be valuable to have, regardless of conclusion. I'll ask the V8 folks for advice.
haraken@chromium.org changed reviewers: + esprehn@chromium.org, hpayer@chromium.org, jochen@chromium.org
I discussed this with Jochen and Hannes. - V8 is reserving 4 MB of address space and 256 KB of physical memory for the marking stack. V8 has a special logic for a case where the marking stack overflows during the marking (as described in #6). - In long term, Oilpan should do the same thing. Oilpan should implement a memory pool shared by all CallbackStacks in the renderer process. The memory pool should reserve X MB of address space and Y KB of physical memory, and Oilpan should have a logic for a case where the stack overflows. - In short term, we need some band-aid fix to reduce the 2.3 MB of physical memory (in M50). One possible fix would be just to reduce the default size of the CallbackStacks (i.e., this CL) and add RELEASE_ASSERTs to understand how often a GC fails at expanding the marking stack. Any thoughts?
On 2016/02/23 14:55:24, haraken wrote: > I discussed this with Jochen and Hannes. > > - V8 is reserving 4 MB of address space and 256 KB of physical memory for the > marking stack. V8 has a special logic for a case where the marking stack > overflows during the marking (as described in #6). > > - In long term, Oilpan should do the same thing. Oilpan should implement a > memory pool shared by all CallbackStacks in the renderer process. The memory > pool should reserve X MB of address space and Y KB of physical memory, and > Oilpan should have a logic for a case where the stack overflows. > > - In short term, we need some band-aid fix to reduce the 2.3 MB of physical > memory (in M50). One possible fix would be just to reduce the default size of > the CallbackStacks (i.e., this CL) and add RELEASE_ASSERTs to understand how > often a GC fails at expanding the marking stack. > > Any thoughts? Thanks for doing that. Dynamic allocations during marking is not something to rely on, or do more of. Unless we trust the OS' VMM to know better with how to handle a periodically used area that's accessed in a stack-like fashion, having one block reserved to CallbackStacks that's decommit-hinted upon GC finish, is one thing to do (i.e., similar to what previously landed here.) UMA recording is preferable to release asserts if you want to gauge high water marks on stack use? (I don't understand what the term "physical memory" means in the above.)
On 2016/02/23 15:18:37, sof wrote: > On 2016/02/23 14:55:24, haraken wrote: > > I discussed this with Jochen and Hannes. > > > > - V8 is reserving 4 MB of address space and 256 KB of physical memory for the > > marking stack. V8 has a special logic for a case where the marking stack > > overflows during the marking (as described in #6). > > > > - In long term, Oilpan should do the same thing. Oilpan should implement a > > memory pool shared by all CallbackStacks in the renderer process. The memory > > pool should reserve X MB of address space and Y KB of physical memory, and > > Oilpan should have a logic for a case where the stack overflows. > > > > - In short term, we need some band-aid fix to reduce the 2.3 MB of physical > > memory (in M50). One possible fix would be just to reduce the default size of > > the CallbackStacks (i.e., this CL) and add RELEASE_ASSERTs to understand how > > often a GC fails at expanding the marking stack. > > > > Any thoughts? > > Thanks for doing that. Dynamic allocations during marking is not something to > rely on, or do more of. > > Unless we trust the OS' VMM to know better with how to handle a periodically > used area that's accessed in a stack-like fashion, having one block reserved to > CallbackStacks that's decommit-hinted upon GC finish, is one thing to do (i.e., > similar to what previously landed here.) UMA recording is preferable to release > asserts if you want to gauge high water marks on stack use? I'm not sure how much we can rely on the OS (I suspect that if we explicitly say "decommit it!", then I guess the OS will just decommit it regardless of the historical access pattern on it), but I'm fine with trying the approach. Maybe should we measure the performance of the previous CL (https://codereview.chromium.org/1686943002/)? > > (I don't understand what the term "physical memory" means in the above.) I meant "committed memory" by "physical memory".
haraken@chromium.org changed reviewers: + yutak@chromium.org
Yuta-san: Would you mind applying https://codereview.chromium.org/1686943002/ and measure smoothness.tough_animation_cases on mobile? We're curious if the CL regresses marking pause time (i.e., frame_times). (We need to land some fix by Friday.)
On 2016/02/23 20:53:18, haraken wrote: > Yuta-san: Would you mind applying https://codereview.chromium.org/1686943002/ > and measure smoothness.tough_animation_cases on mobile? We're curious if the CL > regresses marking pause time (i.e., frame_times). > > (We need to land some fix by Friday.) I can see how it behaves on Windows tomorrow. The semantics of MADV_FREE fits best with the hinting I think you're after; which (still) only means OSX.
On 2016/02/23 20:53:18, haraken wrote: > Yuta-san: Would you mind applying https://codereview.chromium.org/1686943002/ > and measure smoothness.tough_animation_cases on mobile? We're curious if the CL > regresses marking pause time (i.e., frame_times). > > (We need to land some fix by Friday.) Whoops sorry, this message was quite well buried in the flood of emails after two-day offsite leave, and I've noticed this just now. I don't have time today, so will take a look tomorrow.
On 2016/02/24 09:51:17, Yuta Kitamura wrote: > On 2016/02/23 20:53:18, haraken wrote: > > Yuta-san: Would you mind applying https://codereview.chromium.org/1686943002/ > > and measure smoothness.tough_animation_cases on mobile? We're curious if the > CL > > regresses marking pause time (i.e., frame_times). > > > > (We need to land some fix by Friday.) > > Whoops sorry, this message was quite well buried in the flood of > emails after two-day offsite leave, and I've noticed this just now. > > I don't have time today, so will take a look tomorrow. Thanks!
On 2016/02/24 09:55:15, haraken wrote: > On 2016/02/24 09:51:17, Yuta Kitamura wrote: > > On 2016/02/23 20:53:18, haraken wrote: > > > Yuta-san: Would you mind applying > https://codereview.chromium.org/1686943002/ > > > and measure smoothness.tough_animation_cases on mobile? We're curious if the > > CL > > > regresses marking pause time (i.e., frame_times). > > > > > > (We need to land some fix by Friday.) > > > > Whoops sorry, this message was quite well buried in the flood of > > emails after two-day offsite leave, and I've noticed this just now. > > > > I don't have time today, so will take a look tomorrow. > > Thanks! yuta-san: Did you get some result for the above? I need to land https://codereview.chromium.org/1686943002/ (or another CL if the performance isn't acceptable) by the end of today to push it to M50.
On 2016/02/25 09:22:23, haraken wrote: > On 2016/02/24 09:55:15, haraken wrote: > > On 2016/02/24 09:51:17, Yuta Kitamura wrote: > > > On 2016/02/23 20:53:18, haraken wrote: > > > > Yuta-san: Would you mind applying > > https://codereview.chromium.org/1686943002/ > > > > and measure smoothness.tough_animation_cases on mobile? We're curious if > the > > > CL > > > > regresses marking pause time (i.e., frame_times). > > > > > > > > (We need to land some fix by Friday.) > > > > > > Whoops sorry, this message was quite well buried in the flood of > > > emails after two-day offsite leave, and I've noticed this just now. > > > > > > I don't have time today, so will take a look tomorrow. > > > > Thanks! > > yuta-san: Did you get some result for the above? I need to land > https://codereview.chromium.org/1686943002/ (or another CL if the performance > isn't acceptable) by the end of today to push it to M50. Not yet. The measurement system has become quite unstable, and the progress isn't good (< 10% right now). It's unlikely that I'll get meaningful results within today.
On 2016/02/25 09:53:04, Yuta Kitamura wrote: > On 2016/02/25 09:22:23, haraken wrote: > > On 2016/02/24 09:55:15, haraken wrote: > > > On 2016/02/24 09:51:17, Yuta Kitamura wrote: > > > > On 2016/02/23 20:53:18, haraken wrote: > > > > > Yuta-san: Would you mind applying > > > https://codereview.chromium.org/1686943002/ > > > > > and measure smoothness.tough_animation_cases on mobile? We're curious if > > the > > > > CL > > > > > regresses marking pause time (i.e., frame_times). > > > > > > > > > > (We need to land some fix by Friday.) > > > > > > > > Whoops sorry, this message was quite well buried in the flood of > > > > emails after two-day offsite leave, and I've noticed this just now. > > > > > > > > I don't have time today, so will take a look tomorrow. > > > > > > Thanks! > > > > yuta-san: Did you get some result for the above? I need to land > > https://codereview.chromium.org/1686943002/ (or another CL if the performance > > isn't acceptable) by the end of today to push it to M50. > > Not yet. The measurement system has become quite unstable, and > the progress isn't good (< 10% right now). It's unlikely that > I'll get meaningful results within today. OK, thanks. Then it looks better to land the CL and see how perf bots say. I'll do that.
Something to consider? As Blink does most of its Oilpan allocation on the main thread, the thread-local callback stack reservation for other threads doesn't have to be equal to the main thread's.
On 2016/02/29 10:08:30, sof wrote: > Something to consider? As Blink does most of its Oilpan allocation on the main > thread, the thread-local callback stack reservation for other threads doesn't > have to be equal to the main thread's. What size do you prefer? Using CallbackStack::kMinimalBlockSize? (As I said before, my preference is just removing the reservations and adding RELEASE_ASSERTs about OOM but using CallbackStack::kMinimalBlockSize is also fine.)
On 2016/02/29 10:57:34, haraken wrote: > On 2016/02/29 10:08:30, sof wrote: > > Something to consider? As Blink does most of its Oilpan allocation on the main > > thread, the thread-local callback stack reservation for other threads doesn't > > have to be equal to the main thread's. > > What size do you prefer? Using CallbackStack::kMinimalBlockSize? > > (As I said before, my preference is just removing the reservations and adding > RELEASE_ASSERTs about OOM but using CallbackStack::kMinimalBlockSize is also > fine.) To evaluate an appropriate initial block size, you need to look at the use of weak collections + registerWeakMembers() off the main thread. That might be limited to LifecycleNotifier<>'s weak set for workers.
On 2016/02/29 12:22:43, sof wrote: > On 2016/02/29 10:57:34, haraken wrote: > > On 2016/02/29 10:08:30, sof wrote: > > > Something to consider? As Blink does most of its Oilpan allocation on the > main > > > thread, the thread-local callback stack reservation for other threads > doesn't > > > have to be equal to the main thread's. > > > > What size do you prefer? Using CallbackStack::kMinimalBlockSize? > > > > (As I said before, my preference is just removing the reservations and adding > > RELEASE_ASSERTs about OOM but using CallbackStack::kMinimalBlockSize is also > > fine.) > > To evaluate an appropriate initial block size, you need to look at the use of > weak collections + registerWeakMembers() off the main thread. That might be > limited to LifecycleNotifier<>'s weak set for workers. https://codereview.chromium.org/1750553002/ does this. With plenty of extra room, I'm now convinced. |