Issue 2610463002: Use TaskScheduler instead of WorkerPool in nss_util.cc.

fdoray

The CQ bit was checked by fdoray@chromium.org to run a CQ dry run

3 years, 11 months ago (2016-12-31 21:23:28 UTC) #1

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2610463002/1

3 years, 11 months ago (2016-12-31 21:23:34 UTC) #2

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 11 months ago (2016-12-31 22:34:36 UTC) #3

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 11 months ago (2016-12-31 22:34:37 UTC) #4

fdoray

fdoray@chromium.org changed reviewers: + davidben@chromium.org

3 years, 11 months ago (2017-01-01 19:24:04 UTC) #5

fdoray

PTAL. WorkerPool is being deprecated in favor of TaskScheduler.

3 years, 11 months ago (2017-01-01 19:24:05 UTC) #6

davidben

davidben@chromium.org changed reviewers: + rsleevi@chromium.org

3 years, 11 months ago (2017-01-03 13:12:52 UTC) #7

Ryan Sleevi

This is one of those changes that Gab and I had discussed as benefiting from/needing ...

3 years, 11 months ago (2017-01-03 19:03:47 UTC) #9

fdoray

On 2017/01/03 19:03:47, Ryan Sleevi wrote: > This is one of those changes that Gab ...

3 years, 11 months ago (2017-01-04 16:30:22 UTC) #10

fdoray

Description was changed from ========== Use TaskScheduler instead of WorkerPool in nss_util.cc. The following traits ...

3 years, 11 months ago (2017-01-11 13:26:07 UTC) #11

Description was changed from

==========
Use TaskScheduler instead of WorkerPool in nss_util.cc.

The following traits are used:

Priority: Inherited (default)
  The priority is inherited from the calling context (i.e. TaskTraits
  are initialized with the priority of the current task).

Shutdown behavior: CONTINUE_ON_SHUTDOWN
  Tasks posted with this mode which have not started executing before
  shutdown is initiated will never run. Tasks with this mode running at
  shutdown will be ignored (the worker will not be joined).

  Note: Tasks that were previously posted to base::WorkerPool should
  use this shutdown behavior because this is how base::WorkerPool
  handles all its tasks.

May Block:
  Tasks posted with MayBlock() may block. This includes but is not
  limited to tasks that wait on synchronous file I/O operations:
  read or write a file from disk, interact with a pipe or a socket,
  rename or delete a file, enumerate files in a directory, etc. This
  trait isn't required for the mere use of locks.

BUG=659191
==========

to

==========
Use TaskScheduler instead of WorkerPool in nss_util.cc.

The following traits are used:

Priority: Inherited (default)
  The priority is inherited from the calling context (i.e. TaskTraits
  are initialized with the priority of the current task).

Shutdown behavior: CONTINUE_ON_SHUTDOWN
  Tasks posted with this mode which have not started executing before
  shutdown is initiated will never run. Tasks with this mode running at
  shutdown will be ignored (the worker will not be joined).

  Note: Tasks that were previously posted to base::WorkerPool should
  use this shutdown behavior because this is how base::WorkerPool
  handles all its tasks.

May Block:
  Tasks posted with MayBlock() may block. This includes but is not
  limited to tasks that wait on synchronous file I/O operations:
  read or write a file from disk, interact with a pipe or a socket,
  rename or delete a file, enumerate files in a directory, etc. This
  trait isn't required for the mere use of locks.

BUG=659191
==========

fdoray

fdoray@chromium.org changed reviewers: + gab@chromium.org

3 years, 11 months ago (2017-01-11 13:26:08 UTC) #12

gab

On 2017/01/04 16:30:22, fdoray wrote: > On 2017/01/03 19:03:47, Ryan Sleevi wrote: > > This ...

3 years, 11 months ago (2017-01-11 16:13:57 UTC) #13

Ryan Sleevi

On 2017/01/11 16:13:57, gab wrote: > Right, @rsleevi: this is different from all the work ...

3 years, 11 months ago (2017-01-11 18:17:44 UTC) #14

gab

On 2017/01/11 18:17:44, Ryan Sleevi wrote: > On 2017/01/11 16:13:57, gab wrote: > > Right, ...

3 years, 11 months ago (2017-01-11 18:57:38 UTC) #15

On 2017/01/11 18:17:44, Ryan Sleevi wrote:
> On 2017/01/11 16:13:57, gab wrote:
> > Right, @rsleevi: this is different from all the work running on the IO
thread
> > which we discussed. The WorkerPool API is being deprecated in favor of
> > TaskScheduler per the latter's API being a superset of the former (plus it
> > allows chrome to own more of its threads and as such be in a better position
> for
> > scheduling). Is there anything WorkerPool gives you here that TaskScheduler
> > doesn't?
> 
> Yes, it guarantees the creation of distinct threads to run the task. That is,
a
> task submitted to the workerpool will *always* start on a thread (whether
> reusing existing or starting a new one) within a turn of the pump. This
ensures
> we don't have any systemic deadlocks due to multiple resources (which may
> include "surprising" resources, like the GPU tasks) from being blocked on the
> NSS mutex.
> 
> Put differently, we don't want to 'queue' these messages. They need to start
and
> start independently because they may block other, user-sensitive tasks from
> completing (as a consequence of the NSS global mutex)
> 
> base::WorkerPool was chosen over manually managing (e.g. base::Thread) because
> the base::WorkerPool automatically handled any necessary shutdown semantics
w/o
> joining.

When using base::PostTask, the work is posted in parallel, it will be picked up
by one of the TaskScheduler's worker threads regardless of other tasks (i.e.
it's not sequenced with any other work (nor even other tasks posted by from that
same callsite) and will be scheduled the second a worker picks it as its next
task) so I don't see the difference. If you mean that the WorkerPool's thread
count grows without bounds to accommodate for more work this (1) isn't quite
true (i.e. it's capped at 256 threads on Windows) and this (2) is also not a
desirable property, we'd much rather own our threads and decide when we're okay
running one more task.

You get the same shutdown semantics from the TaskScheduler via the
TaskShutdownBehavior::CONTINUE_ON_SHUTDOWN trait.

Ryan Sleevi

On 2017/01/11 18:57:38, gab wrote: > When using base::PostTask, the work is posted in parallel, ...

3 years, 11 months ago (2017-01-11 20:29:42 UTC) #16

On 2017/01/11 18:57:38, gab wrote:
> When using base::PostTask, the work is posted in parallel, it will be picked
up
> by one of the TaskScheduler's worker threads regardless of other tasks (i.e.
> it's not sequenced with any other work (nor even other tasks posted by from
that
> same callsite) and will be scheduled the second a worker picks it as its next
> task) so I don't see the difference. If you mean that the WorkerPool's thread
> count grows without bounds to accommodate for more work this (1) isn't quite
> true (i.e. it's capped at 256 threads on Windows) and this (2) is also not a
> desirable property, we'd much rather own our threads and decide when we're
okay
> running one more task.

Right, considering that the NSS issues only manifest on Linux, I'm ignoring (1),
and suggesting that (2) is a desirable property on some dimensions - such as it
ensures reasoning about what threads will do what.

As an explanation of the convolution and why I'm uneasy with the migration (or
more aptly, I'm uncertain that this isn't going to cause all sorts of explosions
3 weeks into a stable because it involves a user's smart card that we
fundamentally cannot test with - which has happened, including affecting things
like 1.1 million such smartcards used by citizens in a country to do all
government interaction online...):
- Chromium code grabs a lock
- Chromium code calls an NSS function
- NSS code internally locks
- NSS sees smartcard needs authentication
- NSS re-enters Chromium (via function pointer to handle smart card auth)
- Chromium code needs to PostTask to the UI thread to do some work
- UI thread needs to PostTask to the GPU thread to do some work
- [God help me if the GPU thread needs the lock here - I think we shook out all
those cases]
- Additional tasks may be queued on any number of threads
- Eventually UI posts back to the worker (if it's running a message loop) or
tickles the semaphore (if it's blocking)
- Chromium code finishes, returning control to NSS code
- NSS code does the needful
- NSS code unlocks and returns control to Chromium code
- Chromium code unlocks

Basically, whenever an NSS function does a WorkerPool PostTask, it's explicitly
doing so to try to ensure that a new thread will be started to complete the
work. That's the desired behaviour - because it then cuts out many of the
scenarios where tasks are being starved because all worker threads are waiting
on events that can only be processed when more tasks are drained, but cannot
complete until those tasks drain. Starvation.

There's a lot that can be said wrong about this, and I don't disagree that this
is not the optimal position. But in the steady state it's historically
represented little cost (versus the significant cost of the alternatives),
except in the places where people suddenly violate these assumptions. And when
those assumptions are violated, they have to be owned by //net folks to fix.

So that's why I'm pushing back here and trying to see if we can't find a
solution that allows those invariants, and if we can't, that we can quantify why
and what the cost would be to allow that. I'm also trying to figure out if it's
something you're fundamentally opposed to, or whether it's something that can be
accepted as "not ideal, but acceptable"

gab

On 2017/01/11 20:29:42, Ryan Sleevi wrote: > On 2017/01/11 18:57:38, gab wrote: > > When ...

3 years, 11 months ago (2017-01-11 21:29:10 UTC) #17

On 2017/01/11 20:29:42, Ryan Sleevi wrote:
> On 2017/01/11 18:57:38, gab wrote:
> > When using base::PostTask, the work is posted in parallel, it will be picked
> up
> > by one of the TaskScheduler's worker threads regardless of other tasks (i.e.
> > it's not sequenced with any other work (nor even other tasks posted by from
> that
> > same callsite) and will be scheduled the second a worker picks it as its
next
> > task) so I don't see the difference. If you mean that the WorkerPool's
thread
> > count grows without bounds to accommodate for more work this (1) isn't quite
> > true (i.e. it's capped at 256 threads on Windows) and this (2) is also not a
> > desirable property, we'd much rather own our threads and decide when we're
> okay
> > running one more task.
> 
> Right, considering that the NSS issues only manifest on Linux, I'm ignoring
(1),
> and suggesting that (2) is a desirable property on some dimensions - such as
it
> ensures reasoning about what threads will do what.
> 
> As an explanation of the convolution and why I'm uneasy with the migration (or
> more aptly, I'm uncertain that this isn't going to cause all sorts of
explosions
> 3 weeks into a stable because it involves a user's smart card that we
> fundamentally cannot test with - which has happened, including affecting
things
> like 1.1 million such smartcards used by citizens in a country to do all
> government interaction online...):
> - Chromium code grabs a lock
> - Chromium code calls an NSS function
> - NSS code internally locks
> - NSS sees smartcard needs authentication
> - NSS re-enters Chromium (via function pointer to handle smart card auth)
> - Chromium code needs to PostTask to the UI thread to do some work
> - UI thread needs to PostTask to the GPU thread to do some work
> - [God help me if the GPU thread needs the lock here - I think we shook out
all
> those cases]
> - Additional tasks may be queued on any number of threads
> - Eventually UI posts back to the worker (if it's running a message loop) or
> tickles the semaphore (if it's blocking)
> - Chromium code finishes, returning control to NSS code
> - NSS code does the needful
> - NSS code unlocks and returns control to Chromium code
> - Chromium code unlocks
> 
> Basically, whenever an NSS function does a WorkerPool PostTask, it's
explicitly
> doing so to try to ensure that a new thread will be started to complete the
> work. That's the desired behaviour - because it then cuts out many of the
> scenarios where tasks are being starved because all worker threads are waiting
> on events that can only be processed when more tasks are drained, but cannot
> complete until those tasks drain. Starvation.
> 
> There's a lot that can be said wrong about this, and I don't disagree that
this
> is not the optimal position. But in the steady state it's historically
> represented little cost (versus the significant cost of the alternatives),
> except in the places where people suddenly violate these assumptions. And when
> those assumptions are violated, they have to be owned by //net folks to fix.
> 
> So that's why I'm pushing back here and trying to see if we can't find a
> solution that allows those invariants, and if we can't, that we can quantify
why
> and what the cost would be to allow that. I'm also trying to figure out if
it's
> something you're fundamentally opposed to, or whether it's something that can
be
> accepted as "not ideal, but acceptable"

AFAIK, waiting on another thread is an anti-pattern in Chromium. I understand
that it might be necessary sometimes, e.g. in a reentrant call on a random
thread that needs to respond synchronously but requires state from a specific
thread. It should however be so infrequent that having all worker threads
blocked on work which is behind it in the pool's queue is extremely unlikely (if
it did get into that state: Chrome would stall regardless of NSS).

The one thing we just discussed here offline that makes this more likely is that
in the current design, one can request SingleThreadTaskRunner for thread-affine
work and that binds a single worker thread as the only thread allowed to process
that queue. Today that thread can also process other work when idle and as such
could unluckily pick up a task doing NSS which could re-enter and block for a
response meant to be provided by the very thread being blocked.

Based on this we just decided to tweak the TaskScheduler to give
SingleThreadTaskRunners their very own (unshared) thread.

In that world, a parallel task posted to the TaskScheduler is guaranteed to be
processed and we can even give it .WithPriority(TaskPriority::USER_BLOCKING) if
it must be processed as fast as possible. USER_BLOCKING tasks will never starve
(if they somehow did, Chrome would have major issues much beyond NSS).

> I'm also trying to figure out if it's something you're fundamentally opposed
to, or whether it's something that can be accepted as "not ideal, but
acceptable"

We're trying to coalesce all PostTask APIs into TaskScheduler and as such
keeping WorkerPool around for a single use case is highly undesirable. We
instead want to tweak TaskScheduler to handle all use cases.

With the aforementioned tweak to the scheduler, would you be okay with this
change?

Ryan Sleevi

On 2017/01/11 21:29:10, gab wrote: > AFAIK, waiting on another thread is an anti-pattern in ...

3 years, 11 months ago (2017-01-11 21:36:35 UTC) #18

On 2017/01/11 21:29:10, gab wrote:
> AFAIK, waiting on another thread is an anti-pattern in Chromium. 

It is, but it's required and extremely common when interacting with OS or
third-party libraries. Every OS API call can be seen as potentially blocking on
another thread.

I understand
> that it might be necessary sometimes, e.g. in a reentrant call on a random
> thread that needs to respond synchronously but requires state from a specific
> thread. It should however be so infrequent that having all worker threads
> blocked on work which is behind it in the pool's queue is extremely unlikely
(if
> it did get into that state: Chrome would stall regardless of NSS).

At present, it would only stall if the critical threads were blocked - UI, GPU,
IPC, etc. However, it's precisely to avoid this issue that we dispatch NSS calls
to Worker Threads which automatically grow with the work.

> The one thing we just discussed here offline that makes this more likely is
that
> in the current design, one can request SingleThreadTaskRunner for
thread-affine
> work and that binds a single worker thread as the only thread allowed to
process
> that queue. Today that thread can also process other work when idle and as
such
> could unluckily pick up a task doing NSS which could re-enter and block for a
> response meant to be provided by the very thread being blocked.

That is explicitly something we do not want to do here, as that significantly
serializes tasks in a performance-degrading way. This isn't just gut feeling -
we actively experimented with this and paid between 20-100ms latency depending
on device and activity. For some Google internal resources, it added 30s to
startup. And while these cases were influenced by secondary decisions and
unfortunate design tradeoffs by Google services, it's unquestionable that it
does happen.

> We're trying to coalesce all PostTask APIs into TaskScheduler and as such
> keeping WorkerPool around for a single use case is highly undesirable. We
> instead want to tweak TaskScheduler to handle all use cases.

While I understand your goal, I don't think the solution is workable. Worst
case, I feel like the answer is "Reimplement worker pool in //net so you can
deprecate what you want, even though the functionality is still needed."
Understandably, I don't think either of us see that particularly desirable, so
I'm trying to see if there's options we can find. If not though, I do believe
that would be the right solution, versus using TaskScheduler.

> With the aforementioned tweak to the scheduler, would you be okay with this
> change?

No

gab

On 2017/01/11 21:36:35, Ryan Sleevi wrote: > On 2017/01/11 21:29:10, gab wrote: > > AFAIK, ...

3 years, 11 months ago (2017-01-11 21:55:05 UTC) #19

On 2017/01/11 21:36:35, Ryan Sleevi wrote:
> On 2017/01/11 21:29:10, gab wrote:
> > AFAIK, waiting on another thread is an anti-pattern in Chromium. 
> 
> It is, but it's required and extremely common when interacting with OS or
> third-party libraries. Every OS API call can be seen as potentially blocking
on
> another thread.
> 
> I understand
> > that it might be necessary sometimes, e.g. in a reentrant call on a random
> > thread that needs to respond synchronously but requires state from a
specific
> > thread. It should however be so infrequent that having all worker threads
> > blocked on work which is behind it in the pool's queue is extremely unlikely
> (if
> > it did get into that state: Chrome would stall regardless of NSS).
> 
> At present, it would only stall if the critical threads were blocked - UI,
GPU,
> IPC, etc. However, it's precisely to avoid this issue that we dispatch NSS
calls
> to Worker Threads which automatically grow with the work.
> 
> > The one thing we just discussed here offline that makes this more likely is
> that
> > in the current design, one can request SingleThreadTaskRunner for
> thread-affine
> > work and that binds a single worker thread as the only thread allowed to
> process
> > that queue. Today that thread can also process other work when idle and as
> such
> > could unluckily pick up a task doing NSS which could re-enter and block for
a
> > response meant to be provided by the very thread being blocked.
> 
> That is explicitly something we do not want to do here, as that significantly
> serializes tasks in a performance-degrading way. This isn't just gut feeling -
> we actively experimented with this and paid between 20-100ms latency depending
> on device and activity. For some Google internal resources, it added 30s to
> startup. And while these cases were influenced by secondary decisions and
> unfortunate design tradeoffs by Google services, it's unquestionable that it
> does happen.

I'm not sure what you mean by "this" and "it" here but I don't think we're
talking about the same thing..? And as I mentioned, processing other work on
SingleThreadTaskRunners while idle is something we're looking to drop.

> 
> > We're trying to coalesce all PostTask APIs into TaskScheduler and as such
> > keeping WorkerPool around for a single use case is highly undesirable. We
> > instead want to tweak TaskScheduler to handle all use cases.
> 
> While I understand your goal, I don't think the solution is workable. Worst
> case, I feel like the answer is "Reimplement worker pool in //net so you can
> deprecate what you want, even though the functionality is still needed."
> Understandably, I don't think either of us see that particularly desirable, so
> I'm trying to see if there's options we can find. If not though, I do believe
> that would be the right solution, versus using TaskScheduler.

Let's settle on exactly what the problem is though:

Why do you absolutely need another thread? To make sure it runs, right? We could
provide you a TaskTraits::MustRun() trait or something that would create another
thread for that task if all workers are busy (and there are already cases where
we're considering going beyond |max_threads| in the scheduler's impl). But I'm
not sure we need this if we can already guarantee you your task will run..? Can
there be enough NSS tasks to flood the pool itself such that all workers would
block on the NSS lock and the one with the lock would do a reentrant call that
would also block on something in the queue? If not, I don't see what can preempt
your task from running? And since they're all using the same NSS lock,
parallelizing them also doesn't make much sense? We could sequence all NSS tasks
to make sure to avoid this problem?

Worse comes to worst: base::Thread::Options now has a |joinable| field, could
this simply use its own non-joinable thread?

> 
> > With the aforementioned tweak to the scheduler, would you be okay with this
> > change?
> 
> No

Ryan Sleevi

On 2017/01/11 21:55:05, gab wrote: > I'm not sure what you mean by "this" and ...

3 years, 11 months ago (2017-01-11 22:13:33 UTC) #20

On 2017/01/11 21:55:05, gab wrote:
> I'm not sure what you mean by "this" and "it" here but I don't think we're
> talking about the same thing..? And as I mentioned, processing other work on
> SingleThreadTaskRunners while idle is something we're looking to drop.

it = serializing NSS tasks through a bounded set of threads.

> Let's settle on exactly what the problem is though:
> 
> Why do you absolutely need another thread? To make sure it runs, right?

I'm not sure how to parse "to make sure it runs" (and your remarks below re:
guarantee expand on that)

> We could
> provide you a TaskTraits::MustRun() trait or something that would create
another
> thread for that task if all workers are busy (and there are already cases
where
> we're considering going beyond |max_threads| in the scheduler's impl). But I'm
> not sure we need this if we can already guarantee you your task will run..? 

As with guarantee, I'm not sure if you're suggesting "eventually" or "Without
regard to other tasks or pools". A different way to put it might be that I
believe we *don't* want NSS to participate in any Lucky Luke prioritization or
reordering, which may explain some of the disconnect.

> Can
> there be enough NSS tasks to flood the pool itself such that all workers would
> block on the NSS lock and the one with the lock would do a reentrant call that
> would also block on something in the queue? 

This is the thing I'm afraid of, and which has happened in the past, and which
is why we moved NSS tasks like this to the WorkerPool - to ensure they're
consuming a different resource from 'any' other task in Chrome (since, at least
with respect to the platforms that it matters - Linux/ChromeOS - the worker pool
will grow indefinitely and thus inherently not in contention)

> If not, I don't see what can preempt
> your task from running? And since they're all using the same NSS lock,
> parallelizing them also doesn't make much sense? We could sequence all NSS
tasks
> to make sure to avoid this problem?

While I grossly oversimplified with my remarks about NSS lock, it's definitely
the case that there are many in play here, and because it also involves
third-party code (the smartcard), there many be non-NSS locks also involved (on
Windows, we've seen this with respect to GDI resources)

> Worse comes to worst: base::Thread::Options now has a |joinable| field, could
> this simply use its own non-joinable thread?

Maybe. We (//net) explored that around 6 years ago, both through base::Thread
and base::PlatformThread. At the time, the issue was significant parts of //base
had dependencies on the base::AtExitManager (logging & message passing were the
two I'm sure of), through the use of (non-leaky) Singletons and LazyInstances.
The consequence of this meant that even though we explicitly weren't joining the
thread (such as through forced leaks), when the base::AtExitManager shutdown, it
was corrupting those structures and causing shutdown crashes. Obviously, a lot
has changed (and been rewritten), and these dependencies may no longer exist. Or
people replaced Singleton/LazyInstances with ::Leaky variations, and
base::AtExitManager can (maybe, finally) die in a fire.

But that's part of the context on why we ended up using a
base::SequencedWorkerPool (in the case of NSS sockets, since removed) for the
socket operations - as the socket operations were at least sequencable on a
single thread by virtue of the fact that they were already sequenced on the IO
thread - and base::WorkerPool for the non-socket NSS interactions (which itself
was a result of originally using the same pool for both the socket and
non-socket operations, but then finding out that the socket operations were
deadlocking on the non-socket operations, due to the introduction of the GPU
process, which depended on the IO thread)

Again, I want to stress, I'm not trying to defend it as a good design (the ways
in which I hate third-party code has no end, but especially the threading
implications of NSS), and if the feeling is that this is an entirely
unreasonable (or otherwise significant) code cost for y'all on the scheduler
side to take care of, then I want to try to find how we can make it easier.
We're already working to make it easier for //net folks to understand (largely
by excising NSS), but knowing that it's a bit of a delicate system re:
threading, my goal in pushing back is mostly to make sure that in the process of
replacing the jet engines while we're flying, people don't start trying to
replace the landing gear as well - since we'll need that if we crash...

OK, terrible metaphor. Put differently, I'm trying to minimize the scope of
change/behaviour difference as much as possible, so that we don't have to
firedrill if it blows up, rather than continuing work on replacing it so we no
longer have to worry :)

gab

Thanks for the extra context. The issues with base::Thread you mention are no more: non-joinable ...

3 years, 11 months ago (2017-01-12 15:44:50 UTC) #21

davidben

davidben@chromium.org changed reviewers: - davidben@chromium.org

3 years, 11 months ago (2017-01-12 16:09:49 UTC) #22

Ryan Sleevi

On 2017/01/12 15:44:50, gab wrote: > But I'm still unsure why your task must run ...

3 years, 11 months ago (2017-01-12 19:11:46 UTC) #23

On 2017/01/12 15:44:50, gab wrote:
> But I'm still unsure why your task must run right away. I understand the
> reentrancy problem but (1) you only ever post one of these tasks at a time (as
> highlighted in comment below) and (2) the reentrancy problem is only present
> when the task is running, not while it's in the queue, so why is it more
> important to execute it right away (with respect to all other Chromium tasks)?

Ah, perhaps the disconnect is I'm discussing any NSS-using functions being
dispatched to the worker pool if they cause the issue. That is, intentionally,
rather than making a giant omnibus object that tried to manage all NSS
interactions, individual places that run into this issue use the
base::WorkerPool. That is, we treat it like an OS library that should be
'understood' (for some value of understood), as it is part of the LSB, and
instead treat it at the callsites.

Put differently, the task here is not just sequenced with initializing the TPM -
it's sequenced with all other NSS tasks that need to be sequenced like this.

> As for "participating in Lucky Luke prioritization and reordering" I'm not
sure
> what you mean, but running on any thread is participating in some kind of
> prioritization (i.e. Lucky Luke or the kernel's) and reordering wise you're
> posting in parallel to WorkerPool so you have no ordering guarantees
already..?

It seems my attempts to simplify the explanation (since I think we're largely
just not on the same page yet) may be adding to the conflict. Here again, the
concept I was trying to communicate is "We need this task to run, and never be
sequenced against/waiting on any of the other NSS tasks that have been posted".

As I understand Lucky Luke, independent sequences may be reordered or
deprioritized, while tasks within a sequence may block on an earlier task
completing. Both are behaviours we don't want here.

> If your task is crucial to respond to user queries then you mark it with
> TaskPriority::USER_BLOCKING and it will run shortly (most likely sooner than
if
> you created a new thread in the vast majority of cases).

I'm afraid I'm not communicating well, because that's both not the issue here
and still has the major downsides I've been talking about.

gab

3 years, 11 months ago (2017-01-13 01:18:27 UTC) #24

On 2017/01/12 19:11:46, Ryan Sleevi wrote:
> On 2017/01/12 15:44:50, gab wrote:
> > But I'm still unsure why your task must run right away. I understand the
> > reentrancy problem but (1) you only ever post one of these tasks at a time
(as
> > highlighted in comment below) and (2) the reentrancy problem is only present
> > when the task is running, not while it's in the queue, so why is it more
> > important to execute it right away (with respect to all other Chromium
tasks)?
> 
> Ah, perhaps the disconnect is I'm discussing any NSS-using functions being
> dispatched to the worker pool if they cause the issue. That is, intentionally,
> rather than making a giant omnibus object that tried to manage all NSS
> interactions, individual places that run into this issue use the
> base::WorkerPool. That is, we treat it like an OS library that should be
> 'understood' (for some value of understood), as it is part of the LSB, and
> instead treat it at the callsites.
> 
> Put differently, the task here is not just sequenced with initializing the TPM
-
> it's sequenced with all other NSS tasks that need to be sequenced like this.
>  
> > As for "participating in Lucky Luke prioritization and reordering" I'm not
> sure
> > what you mean, but running on any thread is participating in some kind of
> > prioritization (i.e. Lucky Luke or the kernel's) and reordering wise you're
> > posting in parallel to WorkerPool so you have no ordering guarantees
> already..?
> 
> It seems my attempts to simplify the explanation (since I think we're largely
> just not on the same page yet) may be adding to the conflict. Here again, the
> concept I was trying to communicate is "We need this task to run, and never be
> sequenced against/waiting on any of the other NSS tasks that have been
posted".
> 
> As I understand Lucky Luke, independent sequences may be reordered or
> deprioritized, while tasks within a sequence may block on an earlier task
> completing. Both are behaviours we don't want here.

Ok, then I guess that, as much as I think we both dislike the current design,
the only solution is to move WorkerPool (POSIX only) to //net once we're done
migrating all other callsites to keep status quo for NSS tasks.

We'll be happy to revisit this design with you later if you wish. For example,
we could provide a TaskTraits that guarantees a new thread for tasks with it if
all worker threads are currently busy (which is seldom the case so overall you'd
avoid creating new threads in the majority of cases). We're not tempted to
provide that just yet though by fear of abuse while the API is brand new.

Thanks for your time going through this complicated issue.

>  
> > If your task is crucial to respond to user queries then you mark it with
> > TaskPriority::USER_BLOCKING and it will run shortly (most likely sooner than
> if
> > you created a new thread in the vast majority of cases).
> 
> I'm afraid I'm not communicating well, because that's both not the issue here
> and still has the major downsides I've been talking about.

Issue 2610463002: Use TaskScheduler instead of WorkerPool in nss_util.cc. (Closed)

Description

Patch Set 1 #

Messages