|
|
Created:
4 years, 2 months ago by Mircea Trofin Modified:
4 years, 2 months ago CC:
v8-reviews_googlegroups.com Target Ref:
refs/pending/heads/master Project:
v8 Visibility:
Public. |
Description[wasm] Fix wasm instantiation flakes
The spurious failures were caused by the compiled module
template and its corresponding owning object getting out of
sync due to memory allocations (which may trigger GC)
between the points each were fetched.
Specifically, the {original} was first obtained; then a GC
may happen when cloning the {code_table}. At this point,
the {original}'s owner may have been collected, getting us
down the path of not cloning. When time comes to patch up
globals, we incorrectly try to patch them assuming the
global start is at 0 (nullptr), which in fact it isn't.
This change roots early, in a GC-free area, both objects.
Additionally, it avoids publishing to the instances chain
the new instance until the very end. This way:
- the objects used to create the new instance offer a
consistent view
- the instances chain does not see the object we try to
form. If something fails, we can safely retry.
- since the owner is rooted, the state of the front of the
instances chain stays unchanged - with the same compiled
module we started from. So the early belief that we needed
to clone is not invalidated by any interspersed GC.
This situation suffers from a sub-optimality discussed in
the design document, in that, in a memory constrained
system, the following snippet may surprisingly fail:
var m = new WebAssembly.Module(...);
var i1 = new WebAssembly.Instance(m);
i1 = null;
var i2 = new WebAssembly.Instance(m); //may fail.
This will be addressed subsequently.
BUG=v8:5451
Committed: https://crrev.com/b75a0c4a555278de8c59695e55f26a4e2ea6c862
Cr-Commit-Position: refs/heads/master@{#40126}
Patch Set 1 #
Total comments: 5
Patch Set 2 : better fix #Patch Set 3 : better fix #
Messages
Total messages: 37 (26 generated)
The CQ bit was checked by mtrofin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Description was changed from ========== [wasm] fix flaky asm-wasm BUG= ========== to ========== [wasm] fix flaky asm-wasm The spurious failures were caused by collection of compiled module template between time we get its code object array and time we determine if it was owned. In turn, this meant that we didn't use the correct address for patching references to globals. I plan to make a more substantial change to improve the robustness of the instance management, by upfronting all "data collection" phases. For now, this change should remove the flake. BUG=v8:5451 ==========
mtrofin@chromium.org changed reviewers: + bradnelson@chromium.org, titzer@chromium.org
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
Description was changed from ========== [wasm] fix flaky asm-wasm The spurious failures were caused by collection of compiled module template between time we get its code object array and time we determine if it was owned. In turn, this meant that we didn't use the correct address for patching references to globals. I plan to make a more substantial change to improve the robustness of the instance management, by upfronting all "data collection" phases. For now, this change should remove the flake. BUG=v8:5451 ========== to ========== [wasm] fix flaky asm-wasm The spurious failures were caused by finalization of compiled module template between time we get its code object array and time we determine if it was owned. In turn, this meant that we didn't use the correct address for patching references to globals. I plan to make a more substantial change to improve the robustness of the instance management, by upfronting all "data collection" phases. For now, this change should remove the flake. BUG=v8:5451 ==========
https://codereview.chromium.org/2395063002/diff/1/src/wasm/wasm-module.cc File src/wasm/wasm-module.cc (right): https://codereview.chromium.org/2395063002/diff/1/src/wasm/wasm-module.cc#new... src/wasm/wasm-module.cc:1215: WeakCell* tmp = original->ptr_to_weak_owning_instance(); Why do we have a raw pointer here? https://codereview.chromium.org/2395063002/diff/1/src/wasm/wasm-module.cc#new... src/wasm/wasm-module.cc:1221: code_table = factory->CopyFixedArray(old_code_table); I don't understand how this fixes anything. As far as I can tell, it just moves an allocation until after the check.
https://codereview.chromium.org/2395063002/diff/1/src/wasm/wasm-module.cc File src/wasm/wasm-module.cc (right): https://codereview.chromium.org/2395063002/diff/1/src/wasm/wasm-module.cc#new... src/wasm/wasm-module.cc:1215: WeakCell* tmp = original->ptr_to_weak_owning_instance(); On 2016/10/07 12:40:32, titzer wrote: > Why do we have a raw pointer here? Not sure, I think this is from 0xC; it's immaterial, though, it seems to be used for a check only and then we fetch the owner, neither are GC-ing. https://codereview.chromium.org/2395063002/diff/1/src/wasm/wasm-module.cc#new... src/wasm/wasm-module.cc:1221: code_table = factory->CopyFixedArray(old_code_table); On 2016/10/07 12:40:32, titzer wrote: > I don't understand how this fixes anything. As far as I can tell, it just moves > an allocation until after the check. Consider the old code. Suppose the allocation on line 1217 triggers GC for {original}'s owner. As a result, the weak link to the owner is reset by the InstanceFinalizer. We want to do that because that is how we then determine if a compiled module is owned or not. {owner} will be uninitialized, because we're not cloning. Now, when looking to relocate globals, we will believe the old globals address is nullptr (incorrect), and fail to patch the code. The change ensures we catch the owner before any of this happens. As I mentioned, I'm working on a more involved change, to avoid such inconsistencies "once and for all".
https://codereview.chromium.org/2395063002/diff/1/src/wasm/wasm-module.cc File src/wasm/wasm-module.cc (right): https://codereview.chromium.org/2395063002/diff/1/src/wasm/wasm-module.cc#new... src/wasm/wasm-module.cc:1221: code_table = factory->CopyFixedArray(old_code_table); On 2016/10/07 14:44:23, Mircea Trofin wrote: > On 2016/10/07 12:40:32, titzer wrote: > > I don't understand how this fixes anything. As far as I can tell, it just > moves > > an allocation until after the check. > > Consider the old code. Suppose the allocation on line 1217 triggers GC for > {original}'s owner. As a result, the weak link to the owner is reset by the > InstanceFinalizer. We want to do that because that is how we then determine if a > compiled module is owned or not. > > {owner} will be uninitialized, because we're not cloning. > > Now, when looking to relocate globals, we will believe the old globals address > is nullptr (incorrect), and fail to patch the code. > > The change ensures we catch the owner before any of this happens. > > As I mentioned, I'm working on a more involved change, to avoid such > inconsistencies "once and for all". Ok, until then can we have a switch to turn off this optimization until it's fully stabilized? It's caused lots of gc-stress failures and resisted many attempts to fix it so far.
On 2016/10/07 15:37:42, titzer wrote: > https://codereview.chromium.org/2395063002/diff/1/src/wasm/wasm-module.cc > File src/wasm/wasm-module.cc (right): > > https://codereview.chromium.org/2395063002/diff/1/src/wasm/wasm-module.cc#new... > src/wasm/wasm-module.cc:1221: code_table = > factory->CopyFixedArray(old_code_table); > On 2016/10/07 14:44:23, Mircea Trofin wrote: > > On 2016/10/07 12:40:32, titzer wrote: > > > I don't understand how this fixes anything. As far as I can tell, it just > > moves > > > an allocation until after the check. > > > > Consider the old code. Suppose the allocation on line 1217 triggers GC for > > {original}'s owner. As a result, the weak link to the owner is reset by the > > InstanceFinalizer. We want to do that because that is how we then determine if > a > > compiled module is owned or not. > > > > {owner} will be uninitialized, because we're not cloning. > > > > Now, when looking to relocate globals, we will believe the old globals address > > is nullptr (incorrect), and fail to patch the code. > > > > The change ensures we catch the owner before any of this happens. > > > > As I mentioned, I'm working on a more involved change, to avoid such > > inconsistencies "once and for all". > > Ok, until then can we have a switch to turn off this optimization until it's > fully stabilized? It's caused lots of gc-stress failures and resisted many > attempts to fix it so far. You mean, go back to having full GC before instantiating? Yes.
The CQ bit was checked by mtrofin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Description was changed from ========== [wasm] fix flaky asm-wasm The spurious failures were caused by finalization of compiled module template between time we get its code object array and time we determine if it was owned. In turn, this meant that we didn't use the correct address for patching references to globals. I plan to make a more substantial change to improve the robustness of the instance management, by upfronting all "data collection" phases. For now, this change should remove the flake. BUG=v8:5451 ========== to ========== [wasm] fix flaky asm-wasm The spurious failures were caused by the compiled module template and its corresponding owning object getting out of sync due to memory allocations (which may trigger GC) between the points each were fetched. Specifically, the {original} was first obtained; then a GC may happen when cloning the {code_table}. At this point, the {original}'s owner may have been collected, getting us down the path of not cloning. When time comes to patch up globals, we incorrectly try to patch them assuming the global start is at 0 (nullptr), which in fact it isn't. This change roots early, in a GC-free area, both objects. Additionally, it avoids publishing to the instances chain the new instance until the very end. This way: - the objects used to create the new instance offer a consistent view - the instances chain does not see the object we try to form. If something fails, we can safely retry. - since the owner is rooted, the state of the front of the instances chain stays unchanged - with the same compiled module we started from. So the early belief that we needed to clone is not invalidated by any interspersed GC. This situation suffers from a sub-optimality discussed in the design document, in that, in a memory constrained system, the following snippet may surprisingly fail: var m = new WebAssembly.Module(...); var i1 = new WebAssembly.Instance(m); i1 = null; var i2 = new WebAssembly.Instance(m); //may fail. This will be addressed subsequently. BUG=v8:5451 ==========
Description was changed from ========== [wasm] fix flaky asm-wasm The spurious failures were caused by the compiled module template and its corresponding owning object getting out of sync due to memory allocations (which may trigger GC) between the points each were fetched. Specifically, the {original} was first obtained; then a GC may happen when cloning the {code_table}. At this point, the {original}'s owner may have been collected, getting us down the path of not cloning. When time comes to patch up globals, we incorrectly try to patch them assuming the global start is at 0 (nullptr), which in fact it isn't. This change roots early, in a GC-free area, both objects. Additionally, it avoids publishing to the instances chain the new instance until the very end. This way: - the objects used to create the new instance offer a consistent view - the instances chain does not see the object we try to form. If something fails, we can safely retry. - since the owner is rooted, the state of the front of the instances chain stays unchanged - with the same compiled module we started from. So the early belief that we needed to clone is not invalidated by any interspersed GC. This situation suffers from a sub-optimality discussed in the design document, in that, in a memory constrained system, the following snippet may surprisingly fail: var m = new WebAssembly.Module(...); var i1 = new WebAssembly.Instance(m); i1 = null; var i2 = new WebAssembly.Instance(m); //may fail. This will be addressed subsequently. BUG=v8:5451 ========== to ========== [wasm] fix flaky asm-wasm The spurious failures were caused by the compiled module template and its corresponding owning object getting out of sync due to memory allocations (which may trigger GC) between the points each were fetched. Specifically, the {original} was first obtained; then a GC may happen when cloning the {code_table}. At this point, the {original}'s owner may have been collected, getting us down the path of not cloning. When time comes to patch up globals, we incorrectly try to patch them assuming the global start is at 0 (nullptr), which in fact it isn't. This change roots early, in a GC-free area, both objects. Additionally, it avoids publishing to the instances chain the new instance until the very end. This way: - the objects used to create the new instance offer a consistent view - the instances chain does not see the object we try to form. If something fails, we can safely retry. - since the owner is rooted, the state of the front of the instances chain stays unchanged - with the same compiled module we started from. So the early belief that we needed to clone is not invalidated by any interspersed GC. This situation suffers from a sub-optimality discussed in the design document, in that, in a memory constrained system, the following snippet may surprisingly fail: var m = new WebAssembly.Module(...); var i1 = new WebAssembly.Instance(m); i1 = null; var i2 = new WebAssembly.Instance(m); //may fail. This will be addressed subsequently. BUG=v8:5451 ==========
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
The CQ bit was checked by mtrofin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Patchset #2 (id:20001) has been deleted
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
The CQ bit was checked by mtrofin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Description was changed from ========== [wasm] fix flaky asm-wasm The spurious failures were caused by the compiled module template and its corresponding owning object getting out of sync due to memory allocations (which may trigger GC) between the points each were fetched. Specifically, the {original} was first obtained; then a GC may happen when cloning the {code_table}. At this point, the {original}'s owner may have been collected, getting us down the path of not cloning. When time comes to patch up globals, we incorrectly try to patch them assuming the global start is at 0 (nullptr), which in fact it isn't. This change roots early, in a GC-free area, both objects. Additionally, it avoids publishing to the instances chain the new instance until the very end. This way: - the objects used to create the new instance offer a consistent view - the instances chain does not see the object we try to form. If something fails, we can safely retry. - since the owner is rooted, the state of the front of the instances chain stays unchanged - with the same compiled module we started from. So the early belief that we needed to clone is not invalidated by any interspersed GC. This situation suffers from a sub-optimality discussed in the design document, in that, in a memory constrained system, the following snippet may surprisingly fail: var m = new WebAssembly.Module(...); var i1 = new WebAssembly.Instance(m); i1 = null; var i2 = new WebAssembly.Instance(m); //may fail. This will be addressed subsequently. BUG=v8:5451 ========== to ========== [wasm] Fix wasm instantiation flakes The spurious failures were caused by the compiled module template and its corresponding owning object getting out of sync due to memory allocations (which may trigger GC) between the points each were fetched. Specifically, the {original} was first obtained; then a GC may happen when cloning the {code_table}. At this point, the {original}'s owner may have been collected, getting us down the path of not cloning. When time comes to patch up globals, we incorrectly try to patch them assuming the global start is at 0 (nullptr), which in fact it isn't. This change roots early, in a GC-free area, both objects. Additionally, it avoids publishing to the instances chain the new instance until the very end. This way: - the objects used to create the new instance offer a consistent view - the instances chain does not see the object we try to form. If something fails, we can safely retry. - since the owner is rooted, the state of the front of the instances chain stays unchanged - with the same compiled module we started from. So the early belief that we needed to clone is not invalidated by any interspersed GC. This situation suffers from a sub-optimality discussed in the design document, in that, in a memory constrained system, the following snippet may surprisingly fail: var m = new WebAssembly.Module(...); var i1 = new WebAssembly.Instance(m); i1 = null; var i2 = new WebAssembly.Instance(m); //may fail. This will be addressed subsequently. BUG=v8:5451 ==========
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
On 2016/10/07 15:49:43, Mircea Trofin wrote: > On 2016/10/07 15:37:42, titzer wrote: > > https://codereview.chromium.org/2395063002/diff/1/src/wasm/wasm-module.cc > > File src/wasm/wasm-module.cc (right): > > > > > https://codereview.chromium.org/2395063002/diff/1/src/wasm/wasm-module.cc#new... > > src/wasm/wasm-module.cc:1221: code_table = > > factory->CopyFixedArray(old_code_table); > > On 2016/10/07 14:44:23, Mircea Trofin wrote: > > > On 2016/10/07 12:40:32, titzer wrote: > > > > I don't understand how this fixes anything. As far as I can tell, it just > > > moves > > > > an allocation until after the check. > > > > > > Consider the old code. Suppose the allocation on line 1217 triggers GC for > > > {original}'s owner. As a result, the weak link to the owner is reset by the > > > InstanceFinalizer. We want to do that because that is how we then determine > if > > a > > > compiled module is owned or not. > > > > > > {owner} will be uninitialized, because we're not cloning. > > > > > > Now, when looking to relocate globals, we will believe the old globals > address > > > is nullptr (incorrect), and fail to patch the code. > > > > > > The change ensures we catch the owner before any of this happens. > > > > > > As I mentioned, I'm working on a more involved change, to avoid such > > > inconsistencies "once and for all". > > > > Ok, until then can we have a switch to turn off this optimization until it's > > fully stabilized? It's caused lots of gc-stress failures and resisted many > > attempts to fix it so far. > > You mean, go back to having full GC before instantiating? Yes. I think we have to avoid doing an explicit GC.
Gentle reminder - ptal. I re-ran the stress tests a few times. All green, consistently, except for the arm64 timeout, which appears to be a shard over-capacity. Locally, the shard repro command passes.
lgtm
The CQ bit was checked by mtrofin@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Message was sent while issue was closed.
Description was changed from ========== [wasm] Fix wasm instantiation flakes The spurious failures were caused by the compiled module template and its corresponding owning object getting out of sync due to memory allocations (which may trigger GC) between the points each were fetched. Specifically, the {original} was first obtained; then a GC may happen when cloning the {code_table}. At this point, the {original}'s owner may have been collected, getting us down the path of not cloning. When time comes to patch up globals, we incorrectly try to patch them assuming the global start is at 0 (nullptr), which in fact it isn't. This change roots early, in a GC-free area, both objects. Additionally, it avoids publishing to the instances chain the new instance until the very end. This way: - the objects used to create the new instance offer a consistent view - the instances chain does not see the object we try to form. If something fails, we can safely retry. - since the owner is rooted, the state of the front of the instances chain stays unchanged - with the same compiled module we started from. So the early belief that we needed to clone is not invalidated by any interspersed GC. This situation suffers from a sub-optimality discussed in the design document, in that, in a memory constrained system, the following snippet may surprisingly fail: var m = new WebAssembly.Module(...); var i1 = new WebAssembly.Instance(m); i1 = null; var i2 = new WebAssembly.Instance(m); //may fail. This will be addressed subsequently. BUG=v8:5451 ========== to ========== [wasm] Fix wasm instantiation flakes The spurious failures were caused by the compiled module template and its corresponding owning object getting out of sync due to memory allocations (which may trigger GC) between the points each were fetched. Specifically, the {original} was first obtained; then a GC may happen when cloning the {code_table}. At this point, the {original}'s owner may have been collected, getting us down the path of not cloning. When time comes to patch up globals, we incorrectly try to patch them assuming the global start is at 0 (nullptr), which in fact it isn't. This change roots early, in a GC-free area, both objects. Additionally, it avoids publishing to the instances chain the new instance until the very end. This way: - the objects used to create the new instance offer a consistent view - the instances chain does not see the object we try to form. If something fails, we can safely retry. - since the owner is rooted, the state of the front of the instances chain stays unchanged - with the same compiled module we started from. So the early belief that we needed to clone is not invalidated by any interspersed GC. This situation suffers from a sub-optimality discussed in the design document, in that, in a memory constrained system, the following snippet may surprisingly fail: var m = new WebAssembly.Module(...); var i1 = new WebAssembly.Instance(m); i1 = null; var i2 = new WebAssembly.Instance(m); //may fail. This will be addressed subsequently. BUG=v8:5451 ==========
Message was sent while issue was closed.
Committed patchset #3 (id:60001)
Message was sent while issue was closed.
Description was changed from ========== [wasm] Fix wasm instantiation flakes The spurious failures were caused by the compiled module template and its corresponding owning object getting out of sync due to memory allocations (which may trigger GC) between the points each were fetched. Specifically, the {original} was first obtained; then a GC may happen when cloning the {code_table}. At this point, the {original}'s owner may have been collected, getting us down the path of not cloning. When time comes to patch up globals, we incorrectly try to patch them assuming the global start is at 0 (nullptr), which in fact it isn't. This change roots early, in a GC-free area, both objects. Additionally, it avoids publishing to the instances chain the new instance until the very end. This way: - the objects used to create the new instance offer a consistent view - the instances chain does not see the object we try to form. If something fails, we can safely retry. - since the owner is rooted, the state of the front of the instances chain stays unchanged - with the same compiled module we started from. So the early belief that we needed to clone is not invalidated by any interspersed GC. This situation suffers from a sub-optimality discussed in the design document, in that, in a memory constrained system, the following snippet may surprisingly fail: var m = new WebAssembly.Module(...); var i1 = new WebAssembly.Instance(m); i1 = null; var i2 = new WebAssembly.Instance(m); //may fail. This will be addressed subsequently. BUG=v8:5451 ========== to ========== [wasm] Fix wasm instantiation flakes The spurious failures were caused by the compiled module template and its corresponding owning object getting out of sync due to memory allocations (which may trigger GC) between the points each were fetched. Specifically, the {original} was first obtained; then a GC may happen when cloning the {code_table}. At this point, the {original}'s owner may have been collected, getting us down the path of not cloning. When time comes to patch up globals, we incorrectly try to patch them assuming the global start is at 0 (nullptr), which in fact it isn't. This change roots early, in a GC-free area, both objects. Additionally, it avoids publishing to the instances chain the new instance until the very end. This way: - the objects used to create the new instance offer a consistent view - the instances chain does not see the object we try to form. If something fails, we can safely retry. - since the owner is rooted, the state of the front of the instances chain stays unchanged - with the same compiled module we started from. So the early belief that we needed to clone is not invalidated by any interspersed GC. This situation suffers from a sub-optimality discussed in the design document, in that, in a memory constrained system, the following snippet may surprisingly fail: var m = new WebAssembly.Module(...); var i1 = new WebAssembly.Instance(m); i1 = null; var i2 = new WebAssembly.Instance(m); //may fail. This will be addressed subsequently. BUG=v8:5451 Committed: https://crrev.com/b75a0c4a555278de8c59695e55f26a4e2ea6c862 Cr-Commit-Position: refs/heads/master@{#40126} ==========
Message was sent while issue was closed.
Patchset 3 (id:??) landed as https://crrev.com/b75a0c4a555278de8c59695e55f26a4e2ea6c862 Cr-Commit-Position: refs/heads/master@{#40126} |