Issue 8416019: Add a sequenced worker pool

brettw

9 years, 1 month ago (2011-10-28 18:03:54 UTC) #1

jar (doing other things)

IMO, this whole activity cries out for using condition variables. They take care of so ...

9 years, 1 month ago (2011-10-29 02:37:14 UTC) #2

IMO, this whole activity cries out for using condition variables.  They take
care of so much of the lock work, it is amazing.

Secondly, the way you're using locks here is philosophically very bad.  It looks
to me like you are locking code, rather than data.  It is very hard to get code
right when locks are used this way.  Simply put, it is bad juju. ;-)  That leads
to lots of problems, not the least of which is that you do a lot of work under a
lock, and that means that you'll tend to instigate contention.  It is very easy
to get deadlocks, and very hard to see them in such code :-(.

With condition variables, the worker threads do the checking of the queue, and
pull things off to run, instead of having a scheduling thread do the work. 
Condition variables also lead to the better pattern of a) grabbing lock, b)
adjusting data; c) releasing lock.  If you want to see some examples of this
stuff, the condition_variable_unittest.cc has a bunch (to test the
implementation).

More specific comments below.

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
File content/common/sequenced_worker_pool.cc (right):

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
content/common/sequenced_worker_pool.cc:30: : thread_(StringPrintf("Browser
worker %d", thread_number).c_str()),
Thread names are generally bouncy caps, with no spaces (sometimes hyphens). 
This is noticeable in the about:tracking (the profiler).

It is also interesting that line 30's naming convention produces an unbounded
list of names (I noticed this as I was creating names for worker threads in the
profiler).  IMO, it is better to re-use names as they are recycled (some threads
stop... then a bunch of new threads are started... and the names are recycled).
Maybe this is just for debugging... and is a subtle detail for now.

You might also just call them all "TokenizedWorkerThread" and not write the
number into their name.

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
content/common/sequenced_worker_pool.cc:41: // This should only be called from
within Inner's lock.
I couldn't grok the sentence.  
In general, locks should protect data, not code.  It is better to define the
association that way, and if you expect a lock to be held prior to a call (which
manipulates said locked data) I find it helpful to include the word "Locked"
somewhere in the method name.

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
content/common/sequenced_worker_pool.cc:123: size_t max_threads_;
I doubt that you can have a fixed number, unless it is very large.

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
content/common/sequenced_worker_pool.cc:155: // Created lazily when terminated_
is set and there are pending tasks, this
Why is this lazily created?

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
content/common/sequenced_worker_pool.cc:156: // is signaled by ScheduleWork whel
all blocking tasks have completed.
nit: spelling: whel-->when

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
content/common/sequenced_worker_pool.cc:200: // for shutdown whenever the
termianted_ flag is set.
Hmm... this suggests the semantics are to sacrifice order so that you get
SHUTDOWN_BLOCK tasks to run.

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
content/common/sequenced_worker_pool.cc:211: i = pending_tasks_.erase(i);
IF the erase() can do anything, this is a very dangerous call.  You're holding a
lock, and calling an arbitrary method. 
At best I think you should pull out all the tasks you *plan* on erasing, release
the lock, and then delete them.

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
content/common/sequenced_worker_pool.cc:236:
UMA_HISTOGRAM_TIMES("SequencedWorkerPool.ShutdownDelayTime",
Histograms will often be useless at shutdown time.  The problem is that we have
to gather the data up, and persist it, and *that* may be what we're waiting for
as the last task :-/.

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
content/common/sequenced_worker_pool.cc:243: testing_observer_ = observer;
nit: A DCHECK that this was not already holding a pointer would be good.

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
content/common/sequenced_worker_pool.cc:264: return;  // No work to schedule or
no threads to schedule them on.
I don't understand why you returned her.

Is this the case where we've scheduled all the important tasks.. and we notice
that they are using up all our threads.  Why do we want to eject at that point?

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
content/common/sequenced_worker_pool.cc:270: // This algorithm is simple and
fair, but inefficient in some cases. For
I see what you're going for now. 
You want to prioritize the task that has been waiting the longest, and give it
the first open thread.
OK... interesting.

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
File content/common/sequenced_worker_pool.h (right):

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
content/common/sequenced_worker_pool.h:24: //                               
FROM_HERE, base::Bind(...));
Although I have yet to figure out how to do it, I *wish* we didn't have to see
the "FROM_HERE" macro.  As a result, I put it first in the argument list, so
that:
a) It blends with the stuff you don't have to read.
b) If we ever figure out a way to make this better, we can search and replace
the Post...(FROM_HERE with something better. 

Bottom line: I'd humbly request FROM_HERE be listed as a first arg. This will
put less of this garbage in the middle of "real" arguments which the author has
to get right.

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
content/common/sequenced_worker_pool.h:34: // BLOCK_SHUTDOWN behavior, you must
call Shutdown() which will wait until
I think we want most folks to not block shutdown.  As a result, we should make
it mean and obvious when they are blocking (or going to block) shutdown.  For
example, perhaps we need a long name like:
SequenceWorkerPool::REALY_REALY_REALY_HANG_THE_BROWSER_UNTIL_DONE

I want a code reviewer reading this to be VERY aware that the code is insisting
on this behavior.

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
content/common/sequenced_worker_pool.h:66: SKIP_ON_SHUTDOWN,
This is a confusing distinction with CONTINUE_ON_SHUTDOWN.
When you say that "tasks already in progress" will be completed, do you mean you
will join the thread? Wait for completion before teardown?

I guess you're going after something subtle, that some tasks, once started,
become shutdown blockers.  IF they don't start, then they are not blockers, eh?

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
content/common/sequenced_worker_pool.h:70: // sparingly.
What happens if a BLOCK_SHUTDOWN task appears in line (with a given token) after
less aggressive  (SKIP_ON_SHUTDOWN, etc.) tasks with the same token?  Do you
sacrifice the ordering rule (discard the less important tasks, and run the
shutdown blocker first), or do you sacrifice the meaning of CONTINUE_ON_SHUTDOWN
and SKIP_ON_SHUTDOWN, and make the browser wait for all of them?

IMO, this suggests that a given toke should be tagged with a completion enum,
rather than an individual task.

What am I missing?

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
content/common/sequenced_worker_pool.h:77: // fail (return false) and the task
will be deleted.
Hmm.. that sounds like a confusing or racy definition.  If you get into a queue,
then you get to run... 

This might work... but I'm a little skeptical about racing.

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
content/common/sequenced_worker_pool.h:93: int id_;
This probably needs to be an atomic, or more likely, you need a lock to protect
access when incrementing.

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
content/common/sequenced_worker_pool.h:106: // PostSequencedWorkerTask(). Valid
tokens are alwys nonzero.
nit: alwys

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
content/common/sequenced_worker_pool.h:127: // shutdown regardless of the
specified ShutdownBehavior.
I guess it is good that you have weasel words.  Can we do better though?

brettw

No new patch, but I tried to answer your bigger questions. http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_worker_pool.cc File content/common/sequenced_worker_pool.cc (right): ...

9 years, 1 month ago (2011-10-29 05:31:18 UTC) #3

No new patch, but I tried to answer your bigger questions.

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
File content/common/sequenced_worker_pool.cc (right):

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
content/common/sequenced_worker_pool.cc:155: // Created lazily when terminated_
is set and there are pending tasks, this
On 2011/10/29 02:37:14, jar wrote:
> Why is this lazily created?

We only need it when there's a block task running when shutdown is called. No
need to slow down startup creating a lock that probably won't be used and if so,
only at shutdown.

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
content/common/sequenced_worker_pool.cc:200: // for shutdown whenever the
termianted_ flag is set.
On 2011/10/29 02:37:14, jar wrote:
> Hmm... this suggests the semantics are to sacrifice order so that you get
> SHUTDOWN_BLOCK tasks to run.

If there are no free threads, then we do block on any tasks to wait for a free
thread.

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
File content/common/sequenced_worker_pool.h (right):

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
content/common/sequenced_worker_pool.h:66: SKIP_ON_SHUTDOWN,
Yes. SKIP_ON_SHUTDOWN is the default for the current file thread (I think we
empty the task queue and join with the tread, but don't quote me on that).

This is what I think of as the default. I'm worried about the "we'll forget
about you on shutdown" because I think it will be too easy to write crashing
code. Obviously BLOCK has other disadvantages.

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
content/common/sequenced_worker_pool.h:70: // sparingly.
Interesting point, I missed that. I'll move the shutdown mode to the token for
the sequenced case.

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
content/common/sequenced_worker_pool.h:77: // fail (return false) and the task
will be deleted.
I didn't know how else to define this. I wanted a way to write files out, like
the visited link file where you post tasks to the FILE thread to write little
bits to it as stuff happens (in this example, you probably won't get updates
during shutdown, but anyway...).

I want these updates to get written if the user quits, since links will render
incorrectly for subsequent runs if they get dropped.

If I do a task during shutdown, what happens? Today I do
BrowserThread::PostTask(BrowserThread::FILE which then drops the task if we
already joined with the file thread. From reading the comments on those
functions, it looks like those functions even have a third mode where Quit is
already posted to the thread but the thread still exists, in which case the task
is dropped but PostTask still returns true.

So with this model, at least you know what happens (the return values tells you
whether it's guaranteed to run or not). The only other option is to just run the
task right there before returning which seems like a bad idea to me.

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
content/common/sequenced_worker_pool.h:93: int id_;
This isn't changed, it's the value returned to the caller that never changes (I
should probably make this const). The thing that's incremented is in the Inner
class.

http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor...
content/common/sequenced_worker_pool.h:127: // shutdown regardless of the
specified ShutdownBehavior.
See discussion above, I don't think we can do better.

brettw

I'm thinking of this as "replace the FILE thread with something more flexible" rather than ...

9 years, 1 month ago (2011-10-29 05:31:29 UTC) #4

brettw

To add some color to what I was talking about in the previous message... For ...

9 years, 1 month ago (2011-10-29 06:07:23 UTC) #5

jar (doing other things)

Those comments provide a lot more motivation. To summarize: You're not looking at replacing the ...

9 years, 1 month ago (2011-10-29 07:05:32 UTC) #6

Those comments provide a lot more motivation. To summarize:

You're not looking at replacing the current worker pool.  You're only
looking at creating something to help with the File thread pile up.

I'm pretty convinced that for such a case, a mere 2-4 servers (threads) are
all you'll need to nicely handle the logjams.  In any case, the task
profiler I have will show what queueing delays are like for such a setup.
 With the current single-thread File_thread, task duration of 500-1500ms are
common on my machine. The resulting queueing can also be seen.to delay
numerous tasks for most of that time (at startup).

Certainly anything in the above direction (multiple servicing threads) is an
improvement.

I had previously envisioned that several distinct threads would be combined
into such a pool, but I see that is not your plan.  My expectation was that
a bunch of rarely used threads, could all share such a pool, and each "user"
would have one token.  That picture has long-lived tokens.  In contrast, the
picture I now think you're going for has short-lived tokens.  Perchance they
are so long lived, that when a single thread posts 3 tasks in a row, it
might  request/acquire a new token for that trio of tasks.  Is that the
plan??

The big reason for stopping and starting threads is (I think) memory costs.
 A thread carries a stack with it, and that stack may be large.  Long lived
threads create a stack as deep as their largest task requires.. and keep it
forever.  Short lived threads end up using a much smaller (on average)
allocated stack depth (assuming it is committed on demand).  In fact, the
common limit on threads in windows has to do with 32 bit virtual address
space, vs cost a stack per thread.

I'll have to go back and re-read your code with this perspective (replacing
File thread only).

I do still believe (even with a tiny number of servers), that the code
simplicity and correctness would be much better building atop condition
variables.

Jim

On Fri, Oct 28, 2011 at 11:07 PM, <brettw@chromium.org> wrote:

> To add some color to what I was talking about in the previous message...
>
> For this case, we shouldn't really have that many tasks to perform in
> parallel.
> And if we do, my intuition says I don't want to have 20 threads just
> because we
> have 20 tasks. I also start to worry about overall system CPU or I/O
> consumption
> if we're doing that much in parallel.
>
> This is a bit different than DNS lookups for the network stack which uses a
> regular thread pool, where you kind of want more threads for each one
> because
> you know you're just going to be waiting.
>
> Maybe my intuition on this is wrong, however.
>
>
http://codereview.chromium.**org/8416019/<http://codereview.chromium.org/8416...
>

brettw

I started writing a version with condition vars and it didn't seem like it would ...

9 years, 1 month ago (2011-11-22 22:58:41 UTC) #7

michaeln

drive-by: it'd be nice to have this class defined in the 'base' library, i'd like ...

9 years, 1 month ago (2011-11-23 01:58:25 UTC) #8

jar (doing other things)

A high-level API fear is that we really need an enumerated list of of psuedo-threads, ...

9 years, 1 month ago (2011-11-23 20:08:16 UTC) #9

A high-level API fear is that we really need an enumerated list of  of
psuedo-threads, rather than a dynamic token generator.  IMO, too many sequences
of tasks are sent within Chrome at various levels to a given thread, with the
presumption of "sequencing."  I don't think it is a "local thing" relating only
to a function sending several tasks.  I think the tasks are hierarchical,
wherein for example, multiple parent tasks are posted, and each parent tasks
sends numerous child tasks.  We expect the sequence of children tasks to be a
very special traversal order, and I don't think we can emulate that easily with
local sequence numbers.  In this regard, I think your design is too general.

I'm still left feeling that we're pounding in a nail (task) with a jack-hammer
(message loop), but I'm not clear how much perf difference will be involved
(compared with condition variables), and how much we care (given we might have
larger tasks on this file-like tread).  I pointed out some of the visible issues
in comments.

Since I wasn't clear enough about the issue of holding locks in functions, vs
locking data, I gave a number of explicit examples of where the technique is
used in this CL.  I also gave some examples of how the code could be made
cleaner (relative to my targeted locking philosophy).

http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor...
File content/common/sequenced_worker_pool.cc (right):

http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor...
content/common/sequenced_worker_pool.cc:41: // This should only be called from
within Inner's lock.
It is surprising to require a lock (when posting a task).

http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor...
content/common/sequenced_worker_pool.cc:46:
thread_.message_loop()->PostTask(sequenced_info.location, task);
It is more surprising to see a local lock held as we post to a message loop,
which acquires yet another lock.  I expect this is not necessary... but need to
think more about it.

Preliminary thoughts are that we're using a powerful tool, a message loop, to do
some very minimal work: run one task at a time with no pending queue.  It is
kindred to using the super fancy (and heavyweight) thread-safe-observer, when
all that is needed is a permanently registered callback function. 

The net-net is that rather than the cool efficiency of a message loop, which
asymptotically needs a remarkably mere one lock acquisition, and no signaling,
to service a task, we now seem to require 3 lock acquisitions, and 1 signal, for
every task.  We no longer gain any efficiency under load.  The uber lock that is
held has a lot of potential for contention, as it is held for a long time, and
held while acquiring other locks (which can then block it).

I tried to think about the condition variable implementation.  It would require
(under load) asymptoticallly two locks acquisitions, and no signals.

You could argue that the performance of this class is not critical, as we often
run slow tasks in it (which will massively dominate any wimpy locking and
signalling issue).  On the other hand, I see that man tasks on the file thread
are indeed fast... and I'd expect that more tasks would be tossed over here if
we had multiple thread servers.

http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor...
content/common/sequenced_worker_pool.cc:76: class SequencedWorkerPool::Inner
Suggest you add a comment to clarify that this is a singleton (if indeed it is).

http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor...
content/common/sequenced_worker_pool.cc:133: std::deque<Worker*> idle_threads_;
You should use a stack, as I don't think there is a good reason to kill these
threads (and certainly you're not yet doing such to justify a deque).

http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor...
content/common/sequenced_worker_pool.cc:168:
SequencedWorkerPool::Inner::GetSequenceToken() {
Cool as this is... I'm suspicious that we need an API that uses an enumerated
list of tokens, rather than ad-hoc ones.  I *think* that it will be less common
to be able to post two or three tasks with a sequence token, and have that
suffice.  I expect that the task that is running (that posts additional tasks)
will commonly be part of some outer sequenced stream of tasks, and all children
will together need to be sequenced.  

This is accomplished today with our enumerated list of distinct threads.  I
think we should stay with that enumeration model, and just have the freedom to
assign a sequence (enum) to any worker thread at any time.  (even though this
will wreak a bit of havoc with the profiler... but that is should be my
concern... and not an issue for your design).

http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor...
content/common/sequenced_worker_pool.cc:196: void
SequencedWorkerPool::Inner::Shutdown() {
This function is written as a blocking call.  I much prefer asynchronous calls,
especially since we're living in a message loop world (where we shouldn't block
a loop).

What thread are you anticipating will run this call (and block)? I'm a little
wary of a deadlock here as well as performance issues.

http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor...
content/common/sequenced_worker_pool.cc:211: i = pending_tasks_.erase(i);
Does this potentially induce a task destructor?  Do we have assurances on how
long such things take?  I've often wondered if task should not only have a Run()
method, but they should also have a DiscardedWithoutRun() method that should
presumably do MUCH less work.

My philosophy for locks suggest that we should move all the deletes outside the
lock.  You can easily do this here.  You  should put all the tasks you're
thinking of erasing into a separate container, and then delete them after you
release the lock (you might have a self deleting container at function scope,
outside the locked scope, for this purpose, and then you can just do the return
statements on line 219).

http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor...
content/common/sequenced_worker_pool.cc:246: void
SequencedWorkerPool::Inner::ScheduleWork() {
You might consider adding a comment that this function will schedule at most,
one new task, and possibly less if there is nothing to do.
I'd suggest returning a bool to indicate if you succeeded in scheduling work.

http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor...
content/common/sequenced_worker_pool.cc:279: // next one from that token right
away.
That is a good optimization for efficiency... but destroys your fairness
promise. For example, if there are N worker threads, then once we have N
distinct sequence tokens running, they can hog all thread usage until they run
out of tasks :-(.

Methinks that may be what you're saying in the next paragraph.

http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor...
content/common/sequenced_worker_pool.cc:296: Worker* worker;
Should this be a linked_ptr?

http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor...
content/common/sequenced_worker_pool.cc:313: worker->PostTask(*i,
base::Bind(&Inner::RunTask, this, worker, *i));
Now that we have the worker thread out of the idle_threads_ stack, there is no
rush to post this task (while holding the lock).  The current structure makes
this a little difficult (since the whole function scope is locked on entry), but
we could (at line 316) AutoUnlock the lock briefly to be cleaner about the
construction of the Bind(), and the posting of the task (some refactoring would
probably be yet cleaner... such as returning a runable task... but there are
probably better/cleaner ways as well.

http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor...
File content/common/sequenced_worker_pool.h (right):

http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor...
content/common/sequenced_worker_pool.h:59: // it's best to use this only for
slow but simple operations like the DNS
DNS might not be a good example.  DNS probably needs to run on full blown
worker-threads.  They have the potential for really hanging around... and there
are often a lot of them (at least 8, and more with retry logic) in parallel.

http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor...
content/common/sequenced_worker_pool.h:65: // will be completed.
What is the alternative to "will be completed?"  There is no nice way to kill a
thread that is uncooperative.

When you say "will be completed" are you noting that the thread this is run on
will be joined, and hence process shutdown will wait for it?  How else would you
assert "will be completed?"

I'm confused.

I suspect you can only talk about run semantics up to the point you run or
discard.  Once you start to run, you can't abort it, but can delay process
termination to wait for it.

http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor...
content/common/sequenced_worker_pool.h:75: // If a task is posted during
shutdown, it will not get run since the
Just to be clear... I think your rule is: "If this is successfully posted, this
task will be run to completion."

http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor...
content/common/sequenced_worker_pool.h:131: const base::Closure& task);
I'm still feeling the arg order is wrong. I think it should be from_here, then
Closure, then shutdown behaviour, then sequence token.  That would make it look
more like post task, and post delayed task, where the delay element is last
(optional), and from here is always first.

http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor...
File content/common/sequenced_worker_pool_unittest.cc (right):

http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor...
content/common/sequenced_worker_pool_unittest.cc:16: // tests should not be
flaky, and hangling indicates a type of failure. Do not
nit: typo hangling->hanging

brettw

On 2011/11/23 20:08:16, jar wrote: > A high-level API fear is that we really need ...

9 years, 1 month ago (2011-11-23 20:43:14 UTC) #10

brettw

http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_worker_pool.cc File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_worker_pool.cc#newcode168 content/common/sequenced_worker_pool.cc:168: SequencedWorkerPool::Inner::GetSequenceToken() { I think an enum destroys the whole ...

9 years, 1 month ago (2011-11-23 21:01:13 UTC) #11

jar (doing other things)

I think the enum list is the natural way to go. Your concern in a ...

9 years, 1 month ago (2011-11-23 22:00:04 UTC) #12

I think the enum list is the natural way to go.

Your concern in a follow up email is how to put this in base, and still
have an arbitrary list?

To handle that, what you'd need is a more dynamic (rut-time) enumeration.
 I'd suggest that a user can call using a string, and that you hash the
string into an int which is a sequence number.  The interface should be set
up to make that work easily (i.e., probably the call to post a task takes a
string name.)  The hash could be deterministic (MD5), or simple and
collision free (we asign atomic names to each string used).

Such an interface would simulate an arbitrary number of threads (and we can
be adaptive about how many we make!), but would have the ability to
dynamically assign things to a single thread.

I do like the idea of being able to say "just run it" with no serialization
(you used id == 0 in your CL).  I can believe that can be deduced locally
(knowing only the task, and having no care about the cause).  If you went
with strings, perhaps "" would be the "just run it with no serialization
mantra.

Jim

On Wed, Nov 23, 2011 at 12:43 PM, <brettw@chromium.org> wrote:

> On 2011/11/23 20:08:16, jar wrote:
>
>> A high-level API fear is that we really need an enumerated list of  of
>> psuedo-threads, rather than a dynamic token generator.  IMO, too many
>>
> sequences
>
>> of tasks are sent within Chrome at various levels to a given thread, with
>> the
>> presumption of "sequencing."  I don't think it is a "local thing" relating
>>
> only
>
>> to a function sending several tasks.  I think the tasks are hierarchical,
>> wherein for example, multiple parent tasks are posted, and each parent
>> tasks
>> sends numerous child tasks.  We expect the sequence of children tasks to
>> be a
>> very special traversal order, and I don't think we can emulate that easily
>>
> with
>
>> local sequence numbers.  In this regard, I think your design is too
>> general.
>>
>
> Can you clarify what I should do in response to this comment?
>
>
http://codereview.chromium.**org/8416019/<http://codereview.chromium.org/8416...
>

jar (doing other things)

On Wed, Nov 23, 2011 at 1:01 PM, <brettw@chromium.org> wrote: > > Is there an ...

9 years, 1 month ago (2011-11-23 22:13:08 UTC) #13

brettw

On Wed, Nov 23, 2011 at 2:13 PM, Jim Roskind <jar@chromium.org> wrote: > > > ...

9 years, 1 month ago (2011-11-23 22:21:26 UTC) #14

brettw

Michael: I'm going to leave this in content for now. We can move it later. ...

9 years ago (2011-11-25 18:33:38 UTC) #15

brettw

I uploaded a new patch using condition variables. The code is longer. Currently it deadlocks ...

9 years ago (2011-11-25 19:26:03 UTC) #16

jar (doing other things)

Thanks for looking at the alternative. As the comments below show, you didn't fully drink ...

9 years ago (2011-11-25 23:06:08 UTC) #17

Thanks for looking at the alternative.  As the comments below show, you didn't
fully drink the kool-aid for condition variables and threading.  

You have to trust the threads to do more work, and make autonomous decisions. 
You need very little state in the centrol locked singleton.  You should
generally have only a few counters, plus the set of active tokens.

Deadlocks are very hard to induce with condition variables, as there is very
little done under a lock.  I suspect you got tangled up with the waiting event
that you had from your previous implementation.  You should (as noted below)
wait for shutdown by waiting on the CV.

I think that you're also going to reap rewards in shutdown performance, as (per
comments below) you get to do cleanup of discarded events on worker threads
(instead of on "a main controlling thread." You'll also get the cleanup done
while you're waiting for the last critical worker threads to complete.

http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo...
File content/common/sequenced_worker_pool.cc (right):

http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:100: bool
RunnableTasksAreOverSubscribed() const;
I don't think you'll need this function.

http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:133: std::set<Worker*> running_threads_;
I'd expect each thread to hold its own status.  At best, I'd expect counts to be
maintained in this controlling singleton.

http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:158: scoped_ptr<base::WaitableEvent>
shutdown_complete_;
The condition variable would be the natural thing to wait for, even on a
non-worker thread.

http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:246: // of task and free thread counts
would indicate that a new thread should be
I think what you left out of this argument is that the thread that was just
started to handle the unused sequence token will be entrusted to create another
thread if need be.  So long as you know there are threads waiting for your
signal, you should let them handle the work of starting a (single) new thead if
need be.

http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:250: // Therefore, we do a more thorough
brute-force check of runnable tasks to
I think we should just maintain a few counters, and let them form the
"conditions" that are checked, and there would be no new race, or complications.
 

When someone posts a task, if we have more threads running then we have sequence
tokens in the active set, then everything will sort itself out (no need to
figure out if you should start yet another thread or not... the extra running
thread will do that soon enough).

The only time to start a new thread is when all threads are busy, and the new
token is not in the active set.  There is then a race of sorts, as an active
thread can always race to finish against the startup of the new thread, but that
is inevitable with any implementation.  No matter who starts a new thread, it is
possible that instantly we'll have all the other threads finish their work (they
can even be waiting for the lock as your deciding!), leaving us wondering why we
started a new thread.  The difference is that with conditional variables you
notice this race (logic is simpler, and you can see it), whereas with the
previous implementation you started a message loop and task, and ignored the
startup time (which was actually much more extensive).

Bottom line: a race can always cause us to create one more thread than we "could
have gotten away with."

http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:258: // it's not necessarily fast and it
could block other work), but avoiding
All you do inside the lock is bump the counter on the number of started threads.
 You then exit the lock, and start the thread.

http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:260:
threads_.push_back(linked_ptr<Worker>(new Worker(this, threads_.size())));
I don't think you need to maintain a list of threads. I think they should be
self owning.  As a result, I don't think you need this push inside the lock.

I think you need a count of the number of threads that need to be waited for at
shutdown.  That is what any caller that wants to effectivel join the threads
should check for as a "condition" when they are waiting for those urgent threads
to complete.

http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:292: dropped_tasks.push_back(*i);
This cleanup action should happen in a worker thread. This will also keep such
cleanup off the critical path of the thread that called shutdown, unless you
want to assert that cleanup (delete of discarded tasks) must be waited for.  If
so, then worker threads should each take it upon themselves to delete one
discarded task at a time... and they'd all run in parallel.

http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:306: shutdown_complete_.reset(new
base::WaitableEvent(false, false));
Instead of using a waitable event, you should wait on the condition variable,
testing to see that cleanup and runs are complete.

jar (doing other things)

http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_worker_pool.cc File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_worker_pool.cc#newcode336 content/common/sequenced_worker_pool.cc:336: task.task.Run(); One other thing... it is common to signal ...

9 years ago (2011-11-26 01:25:59 UTC) #18

brettw

http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_worker_pool.cc File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_worker_pool.cc#newcode292 content/common/sequenced_worker_pool.cc:292: dropped_tasks.push_back(*i); How can you guarantee that there is a ...

9 years ago (2011-11-26 01:54:40 UTC) #19

http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo...
File content/common/sequenced_worker_pool.cc (right):

http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:292: dropped_tasks.push_back(*i);
How can you guarantee that there is a worker thread available to run the
discarded event? They may all be running CONTINUE_ON_SHUTDOWN tasks and will
never run more code from us again.

http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:306: shutdown_complete_.reset(new
base::WaitableEvent(false, false));
You want me to wait on the same condition variable I use for the workers? That
seems super confusing to me.

http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:336: task.task.Run();
On 2011/11/26 01:25:59, jar wrote:
> One other thing... it is common to signal after you take on work, to have some
> other worker check to see if there is more work.  You could spend the time in
> this thread seeing if you "need help" servicing everything, or you could just
> signal, and let some other thread check (in parallel) to see if there is more
> work to do.
> 
> That missing line might be a part of your deadlock also.

You're right, that's what was causing my problem. I think this adds to my
argument that condition variables are more difficult to understand than the
normal Chrome constructs.

> Since you did go with a lazy start of threads, you'll also need a boolean to
> indicate if you should spawn another worker just before you start to do your
> "real" work (the call of Run()).

I'm going to check the queue every time I take something out. It's difficult to
figure this out ahead of time since there may be multiple threads required.

> 
> The entity that decides to spawn a new thread should bump up the count on
active
> threads (inside the lock), so there is no race or confusion about the count of
> active threads.

jar (doing other things)

I continue to think this is the perfect job for CVs. I think this sort ...

9 years ago (2011-11-26 06:45:35 UTC) #20

I continue to think this is the perfect job for CVs.  I think this sort of
application is probably the most typical example of what CVs are used for
(though you've added some interesting semantic twists for the serialization).  I
also see your current trip towards CVs as being valuable, and illuminating a
number of less clear elements of the spec (state machines tend to be very
forthcoming about covering all possible transitions).

I have the sense that you're really opposed to these... and really think the
message loop hammer is a better approach.  ...but since you're writing the code,
go with what you feel comfortable with, and I'll comment in that domain.

http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo...
File content/common/sequenced_worker_pool.cc (right):

http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:292: dropped_tasks.push_back(*i);
Interesting question.   I hadn't thought about that as a requirement.  It is a
great question.   Do you want the semantics to guarantee destruction of discard
tasks?  I'm surprised to see you suggest that you're willing to wait for them to
get destroyed, but yet you're not waiting for them to complete a run :-/.

IMO, given that you're willing to leave some non-critical tasks running, it is
no big deal to leave some tasks undeleted if you get "too busy."  I'd expect
that more typically, the non-critical tasks would finish, and you'd soon be down
to a few critical tasks, and a bunch of idle threads... just waiting to do your
deletions.  

Your question is very similar to the question of what to do if you have all your
threads "stuck" running non-critical tasks, but you  need to run critical tasks
at shutdown.  What do you think the right thing is to do?

Do you want the semantics to be that the "must run" tasks will be run only on a
restricted set of worker threads?  Are you willing to sacrifice the thread count
limit?  ...or is that firm, and you'll wait indefinitely for a time-slot on a
busy worker thread?   ... or do you want to try to run the "must run" tasks on
the thread that calls for a shutdown, since it was going to wait indefinitely
anyway... hmm.

My guess is that this is a great question... and we should settle on the
semantics. 

WDYT?  Hang?

At the heart of this, the current official release has a shutdown problem that
revolves around a non-critical(?) task hanging, and blocking all availability of
"the thread pool" (a single File Thread) from running a critical task.

I think that during shutdown, you have to make some extra threads available to
ensure that you can service the critical tasks.  I suspect that otherwise, we'll
sometimes get a bunch of hung threads, and have no thread power to handle the
shutdown.  This "reserve tank" approach is commonly used in memory allocators to
facilitate more controlled crashes when we're "out of memory."  Maybe we need a
reserve number of threads for such emergencies.

On 2011/11/26 01:54:40, brettw wrote:
> How can you guarantee that there is a worker thread available to run the
> discarded event? They may all be running CONTINUE_ON_SHUTDOWN tasks and will
> never run more code from us again.

http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:306: shutdown_complete_.reset(new
base::WaitableEvent(false, false));
On 2011/11/26 01:54:40, brettw wrote:
> You want me to wait on the same condition variable I use for the workers? That
> seems super confusing to me.

Clearly the goal is to wait until the system achives some "state" (all required
threads have run). It seems really natural to wait for such an event.

CVs are really good about avoiding the race between checking state, and waiting.
 CVs can be built atop waitable events, and vice versa.

http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:336: task.task.Run();
On 2011/11/26 01:54:40, brettw wrote:
> I'm going to check the queue every time I take something out. It's difficult
to
> figure this out ahead of time since there may be multiple threads required.
> 

The trick is to just move incrementally in the right direction.  You don't need
to start a lot of threads. At most, one more... and that thread, when it starts,
may choose to start another.

IMO, the simple approach is to always have one extra on hand.  Any time you run
a task, if you think your are the last idle thread, and you're belowe the limit,
you should start another.  As I said, you should try to make it simple, and just
look at the few state variables (the conditions) to see what you should do.

If you want to be fancy, when you call the function that finds the next task
that is eligible to run, that function can return a bool indicating if there are
other tasks that could have been run as well.  That would give you the minimal
thread creation.

jar (doing other things)

I thought about it more, and think the deletion question for un-run tasks may be ...

9 years ago (2011-11-26 15:34:23 UTC) #21

jar (doing other things)

Question: If you use a message loop based implementation, how would you correctly(?) handle having ...

9 years ago (2011-11-26 15:54:43 UTC) #22

brettw

I agree that the cond vars are the correct CS way to solve this problem, ...

9 years ago (2011-11-26 21:23:47 UTC) #23

brettw

> Bottom line: I think that IF you're going to delete tasks that have not ...

9 years ago (2011-11-26 21:25:54 UTC) #24

brettw

I see the problem now. We can't delete tasks in sequence without extra threads. I'll ...

9 years ago (2011-11-26 21:45:20 UTC) #25

brettw

I see the problem now. We can't delete tasks in sequence without extra threads. I'll ...

9 years ago (2011-11-26 21:45:22 UTC) #26

brettw

New snap up. This does a best effort to delete tasks while keeping everything sequenced. ...

9 years ago (2011-11-26 22:04:58 UTC) #27

michaeln

On 2011/11/25 18:33:38, brettw wrote: > Michael: I'm going to leave this in content for ...

9 years ago (2011-11-28 21:30:51 UTC) #28

brettw

On 2011/11/28 21:30:51, michaeln wrote: > On 2011/11/25 18:33:38, brettw wrote: > > Michael: I'm ...

9 years ago (2011-11-28 21:46:13 UTC) #29

michaeln

> How do you do today what you want to do with this thread pool? ...

9 years ago (2011-11-28 23:36:12 UTC) #30

jar (doing other things)

Again... if you're hating CVs... and havin' to listen to me yap... please feel free ...

9 years ago (2011-11-29 18:44:15 UTC) #31

Again... if you're hating CVs... and havin' to listen to me yap... please feel
free to revert to be heavy hammer mesage loop.

Comments include a combination of places where the code was longer than need be,
and some general comments.

You're very close to a clean CV implementation IMO.  I'm betting it will be
smaller than the message loop variant (if you got the message loop variant to
handle as much stuff).

I'm ok with the apparent plan to discard (not delete) tasks that are not run,
but you should document that in the headers.

http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo...
File content/common/sequenced_worker_pool.cc (right):

http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:107: std::vector<linked_ptr<Worker> >
threads_;
You don't need to have a list of threads.  All you really needed is:
active_threads_count_

The current threads_.push() become an increment
threads_.size() is just active_threads_count_

...and your join() is just a wait on the CV for the count to reach 0 (which will
happen slightly before you'd hear about the thread fully finishing via a join().

You just have to decrement the counter when you decide to exit your main run
loop (at line 302... and you should signal()).

http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:187: cond_var_.Broadcast();
This works... but assuming each thread will, if it finds any work to do, signal
for others to check out the state (condition), you don't really need a
broadcast(), and a mere signal() will do.  

In general, broadcast is (I think) rarely used.  All the threads would wake up,
only to find that they couldn't access the lock, and then they'd go to sleep
waiting :-/.  Signal() tends to nicely sequence the tasks as locks are released.
 

Broadcast is safe... and maybe this is NBD, since perf is not that critical...
but I think signal() is mostly a better pattern to stay with.

Also note: Signal() and Brodacast() can both be done outside of a lock... and
usually that is better, since the awakened threads have a chance at the lock
being available.

http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:242: cond_var_.Signal();
nit: Nicer to signal when after you release the lock.

IMO, as a cleaner alternative, you should just always signal() when you release
the lock in the main run loop. That means both just before run(), and when you
exit the loop.

http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:260: // There are no pending or running
tasks blocking shutdown, we're done.
This is perfect.  You've effectively checked the condition (under the lock), and
decided that everything is done. That is indeed the clean way to use a CV.

If you changed the "if" to a "while" and negated the condition, you could put a
CV wait here and be done.

...and you would not have to mess with the waitable event.... which is what you
effectively realized in the if statement above.

http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:347: i = pending_tasks_.erase(i);
So you decided to leak the less important tasks?  I thought you were planning on
deleting them in sequence on these worker threads.

http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:348: pending_task_count_--;
if you add a "continue;" here, you won't need ane else clouse on line 349,
you'll have less indentation, and it will be more readable IMO.

http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:355:
blocking_shutdown_pending_task_count_--;
Lines 353-355 are critical accounting lines, and keep everything in sync.
Similar accounting is done in lines 347-348. 

If you refactor this code, then you'll only need to do this once (and have less
chance of messing up if you change your accounting).

http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:362: ++i;
This was harder to read than needed, because this clause is so far away from the
if(...) in line 338.  

IMO, it would be better to negate the condition in line 338, put these two lines
immediately after line 338, and then add a "continue;" so you don't need an else
clause at all.

http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:436: for
(std::list<SequencedTask>::iterator i = pending_tasks_.begin();
I was hoping/expecting that you wouldn't need to start a thread under a lock.  I
also noticed that this function, as written, is a little wrong, as it will start
threads at shutdown even if the only tasks that are available don't block
shutdown.

IMO, this iteration code should be replaced by a variant of GetWork().  You
should adjust GetWork() to accept a null pointer, and then the boolean return
value would tell you if there was a need for a new thread.  You'd have to call
it just after line 293 (when you adjusted your sets and counts).

Once it was called at line 293, it would be simple to assign a result to a
boolean so as to create the new thread outside the lock (after you autounlock in
line 295.

http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo...
File content/common/sequenced_worker_pool.h (right):

http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo...
content/common/sequenced_worker_pool.h:52: // deleted rather than run, and any
tasks with this mode running at
This needs to be update.  I think you are leaking them now... but I didn't check
all cases... you might even be deleting some, and not others.

http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo...
content/common/sequenced_worker_pool.h:70: // will be deleted rather than
executed. However, tasks already in progress
Semantics change: you don't delete.

http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo...
content/common/sequenced_worker_pool.h:83: // fail (return false) and the task
will be deleted.
I don't think they should be deleted if we're leaking others, especially if they
might break the currently active run() of prior sequenced tasks.

New snap up.

I'll move this to base but I didn't want to do it now or it would interfere with
the diffs. I'll do that right before checking in.

http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo...
File content/common/sequenced_worker_pool.cc (right):

http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:107: std::vector<linked_ptr<Worker> >
threads_;
This won't work. I need the actual SimpleThread* to explicitly call Join on it
in the destructor, or it asserts.

http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:187: cond_var_.Broadcast();
Okay, done.

http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:260: // There are no pending or running
tasks blocking shutdown, we're done.
I did this but couldn't write the loop here to keep the semantics of the testing
observer, which made the result a bit more complicated than you probably
envisioned. I had to tweak the termination condition as well.

http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:347: i = pending_tasks_.erase(i);
This is the worker thread, but I forgot to do it outside of the lock. Because of
the sematics of closures, doing this got a bit more complicated.

http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:355:
blocking_shutdown_pending_task_count_--;
It does not seem clear to me that there's an obvious refactor beyond what my new
patch does.

http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:436: for
(std::list<SequencedTask>::iterator i = pending_tasks_.begin();
I don't want to re-use GetWork. It's already complicated enough. And my
additional code to properly delete tasks outside of the lock would actually make
this somewhat tricky to use in this location since this function would also need
to handle task deletion, and proxy that out to the calling function.

Because I need to keep an accurate accounting of all threads to properly join
them in the destructor, it's difficult for me to see a solution that does not
create the thread in the lock. It seems any attempt to do that will result in a
shutdown race that can miss the thread, which will cause a flaky assert in
~SimpleThread.

brettw

Actually, I think I figured out how to create threads outside of the lock. We ...

9 years ago (2011-12-01 18:23:25 UTC) #33

michaeln

Thnx! > I'll move this to base but I didn't want to do it now ...

9 years ago (2011-12-02 02:43:07 UTC) #34

brettw

Thread creation now happens outside of the lock. It's kind of subtle, I hope the ...

9 years ago (2011-12-02 04:35:26 UTC) #35

jar (doing other things)

Most of my comments focus on chasing the following pattern (which you mostly do already). ...

9 years ago (2011-12-09 19:01:47 UTC) #36

Most of my comments focus on chasing the following pattern (which you mostly do
already).  I'll just give the pattern so that you can argue against it if you
think it is not right.

There is a general pattern that makes this class of code easy to write, and
offer very little surprises.  The general approach is involves almost always
signaling when you release the lock.  Here is the more detailed formulation:

a) Do any stuff you can before you take the lock.

b) Acquire the lock, and decide what you should do, setting local variables to
remind you (when you release the lock) what to do.  If you have nothing to do
(then other threads have nothing to do!) then go into a wait state.

c) Release the lock.  (sometimes via AutoUnlock... but this is your choice based
on loop structure convenience).

d) Since you had work to do, you can almost routinely signal to see if others
can find work to do.  This also takes care of the case where you've modified the
state (conditions) to get others to do stuff.

e) Do what you planned on doing (based on local flags etc.).

f) Go back to (a) (after cleaning up after what you did).

The above pattern include shutdown:
b1: You decide you should shutdown.

c1: Release the lock (often by "break"ing out of a loop, and generally not via
AutoUnlock)

d1) Signal (which gets other folks to shutdown).

You've moved to doing this more, but the above is super safe and simple. 
Holding the lock as briefly as possible (example: avoid doing big work under the
lock, and only change states) ensures there is little contention from other
threads looking for work to do.  It also precludes almost all deadlocks. 
Similarly, the standard of signal-when-unlocking (other than when go into a wait
state) means you don't need to worry about checking of other threads might find
work: they can just check for themselves.

http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo...
File content/common/sequenced_worker_pool.cc (right):

http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:107: std::vector<linked_ptr<Worker> >
threads_;
Hmm.... this is a confusing requirement.  What is the assert concerned with?? 
I'll look at the code.

On 2011/12/01 02:44:57, brettw wrote:
> This won't work. I need the actual SimpleThread* to explicitly call Join on it
> in the destructor, or it asserts.

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo...
File content/common/sequenced_worker_pool.cc (right):

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:14: #include "base/threading/thread.h"
nit: alphabetize

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:54: Inner(size_t max_threads);
nit: explicit for constructor

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:126: size_t max_threads_;
Is this const from construction time?

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:175: StringPrintf("BrowserWorker%d",
thread_number).c_str()),
Since this class appears to be designed so that we can have several distinct
pools, you should probably have the prefix-name (here shown as "BrowserWorker")
as a parameter to the construction, which is held in inner.

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:256: sequenced.task = task;
Although not a big deal, lines 252-256 should as a matter of standard practice,
be performed outside the lock.

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:334: base::AutoUnlock unlock(lock_);
This inner loop looks excellent.  I'd suggest doin a signal() just after the
unlock.

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:345: task.task = base::Closure();
:-) Perfect.

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:347: DidRunWorkerTask(this_worker,
task);  // Must be done inside the lock.
Perhaps I'm a little confused here... was the task we just ran deleted while the
lock was released (just after it was run)?
If so, all we need to preserve from line 332 is the two bits of state that can
be fed in here to DidRunWorkerTask() (re: was this a BLOCKING task, so we need
to fix the count, and what was the sequence number that we have to pull from the
set<>).

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:362: cond_var_.Signal();
If you add one more scope to hold the
AutoLock() on line 326, so you'd have a place to place the signal() outside the
lock.

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:428:
delete_these_outside_lock->push_back(i->task);  // See above.
I *think* you take care of deleting this yourself separately.  In addition, if
you want to go this route, you should do the clear of delete_these_outside_lock
*after* you run the task around line 345..

...or am I confused?

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:473: cond_var_.Signal();
As mentioned above... is is simpler to always signal when you release the lock
to do some work... so you won't need to signal here.  You *could* be more
precise about the signalling.... but it is much easier to see everything is
perfect if you don't bother (and you avoid checking for "more work" in one
thread, and checking for "some work to do" in the signalled thread).

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:483:
DCHECK(blocking_shutdown_thread_count_ > 0);
nit: DCHECK_GT

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:513: // refuse to create more threads or
they would be leaked.
IMO, when you do the preparation for starting a new thread, you should update
(here and now) all the counts you care about.  That way, you won't have to wait
for the thread to really start and do the additions to the state.

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:525: thread_being_created_ = true;
This is a nice flag.... as it can keep us from starting two thread when one will
suffice.

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:526: return threads_.size() + 1;
This is where you should do all the state changes to account for the new
addition of a thread.  You should effectively account for the thread as though
it is asleep (waiting).  IT will soon startup, and will go through all the
checking that a thread would do when it wakes up.

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo...
File content/common/sequenced_worker_pool.h (right):

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.h:52: // deleted rather than run, and any
tasks with this mode running at
Your comments came later about how deletion is not guaranteed... but you should
update this to at least say "deleted (if time permits) rather than run.."

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.h:70: // will be deleted rather than
executed. However, tasks already in progress
again: "...deleted (if time permits) and will not be executed."

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.h:101: SequenceToken(int id) : id_(id) {}
nit: add explicit

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.h:109: virtual void WillWaitForShutdown() =
0;
nit: add virtual destructor.

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.h:112: SequencedWorkerPool(size_t
max_threads);
nit: explicit

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.h:154: bool
PostSequencedWorkerTask(SequenceToken sequence_token,
I'm tempted to believe that tasks should by default be required for shutdown.  I
think it takes a lot of thinking to be sure you're an "optional" task.  As a
result, I'd like to not bother this arg list with the shutdown behavior, and
have  a second method like:
PostNonCrucialSequencedWorkerTask(...)

WDYT?  Do you really want every posted task to think through this and make a
choice?  I'm also wary that folks will "copy" the calling pattern, and not think
through the ramifications.

Today, all FileThread tasks are "important," as I *think* we  join the thread on
completion (yes??).  As a result, the default migration should be to those
semantics.

willchan no longer on Chromium

Just saw this while skimming the code. http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_worker_pool.cc File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_worker_pool.cc#newcode537 content/common/sequenced_worker_pool.cc:537: linked_ptr<Worker> new_thread(new ...

9 years ago (2011-12-09 21:41:47 UTC) #37

jar (doing other things)

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_worker_pool.cc File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_worker_pool.cc#newcode537 content/common/sequenced_worker_pool.cc:537: linked_ptr<Worker> new_thread(new Worker(this, thread_number)); I thought the access to ...

9 years ago (2011-12-09 22:13:58 UTC) #38

brettw

I fixed everything except the one thing noted below. http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_worker_pool.cc File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_worker_pool.cc#newcode362 content/common/sequenced_worker_pool.cc:362: ...

9 years ago (2011-12-10 05:42:27 UTC) #39

jar (doing other things)

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_worker_pool.cc File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_worker_pool.cc#newcode362 content/common/sequenced_worker_pool.cc:362: cond_var_.Signal(); AutoLock() is on line 326, which is the ...

9 years ago (2011-12-10 17:08:03 UTC) #40

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo...
File content/common/sequenced_worker_pool.cc (right):

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:362: cond_var_.Signal();
AutoLock() is on line 326, which is the outermost scope of the function.  As a
result, this Signal() is still done while holding the lock.  

This is not critical to functional correctness, but at times it can cause what
is termed the "box-car performance problem."  The problem can involve a thread
waking up due to a Signal(), only to find that it can't acquire the lock, and
then going back to sleep waiting on the lock.

I think Clank manifests this problem, and often does a full-blown context switch
(on a single core CPU!) when a signal() is provided.  That is guaranteed to land
on the other thread with the lock already in use.  I *think* that was the most
recent horror story I heard about this.

On 2011/12/10 05:42:27, brettw wrote:
> This is already done the way you are describing.

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:526: return threads_.size() + 1;
On 2011/12/10 05:42:27, brettw wrote:
> It's not clear what you want me to move here.

IMO, the number of running (including to-be-created) threads should be a state
variable, and that is what I think should be updated here.  I applogize for
having that as my mental model, which is not the current code, making my
suggestion very confusing.

Currently, you rely on threads_.size() to supply this count.  That vector size
will only be correctly updated after the new thread has started, and has updated
the vector with a push().  If instead you used a state variable to keep track of
the count of threads, then you can update it here (showing intent to start a new
thread) and hence have it correctly set to while you still hold the lock.

That change would remove the concern about leaking expressed in line 512.  If we
don't want to leak, then we wait until the count of running threads goes to
zero, and that count is not based on the vector size.

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo...
content/common/sequenced_worker_pool.cc:537: linked_ptr<Worker> new_thread(new
Worker(this, thread_number));
IMO, the would-be-reusable code of SimpleThread is not as reusable as it could
be, due to this requirement.  

The requirement should be removed (if it is really not critical to avoiding
abuse), or the class should be modified to make it more reusable.  We could, for
instance, have a setter (or a constuctor arg) that turns off this CHECK().

On 2011/12/10 05:42:27, brettw wrote:
> On 2011/12/09 22:13:58, jar wrote:
> > I thought the access to this ref-counted data structure was only done under
> the
> > protection of a lock.  You are looking at the creation of the pointer with
> > associated structure, prior to making it available.   No one else has access
> to
> > it until it is pushed into the list of threads (inside the lock).
> 
> Will is correct, this is bad.
> 
> > The vector is only kept around to be able to do the join... and I've argued
> that
> > the join is not needed (we should just look for the state to indicate that
all
> > critical threads are complete.
> 
> As I explained before, the requirement that the tread be joined is enforced by
> the tread code in base. I didn't choose to add this. I think the existing
> semantics of SimpleThread are outside the scope of this review.

brettw

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_worker_pool.cc File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_worker_pool.cc#newcode362 content/common/sequenced_worker_pool.cc:362: cond_var_.Signal(); You are incorrect, the AutoLock is not at ...

9 years ago (2011-12-10 17:25:28 UTC) #41

brettw

I originally had the created threads flag be an integer count of the number of ...

9 years ago (2011-12-10 17:29:37 UTC) #42

jar (doing other things)

http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_worker_pool.cc File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_worker_pool.cc#newcode362 content/common/sequenced_worker_pool.cc:362: cond_var_.Signal(); You are correct. I was incorrect. Sorry... I ...

9 years ago (2011-12-10 18:41:43 UTC) #43

michaeln

drive by comments, looks pretty nice http://codereview.chromium.org/8416019/diff/49001/content/common/sequenced_worker_pool.cc File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/49001/content/common/sequenced_worker_pool.cc#newcode16 content/common/sequenced_worker_pool.cc:16: #include "base/synchronization/waitable_event.h" is ...

9 years ago (2011-12-14 20:37:46 UTC) #44

jar (doing other things)

With nits mentioned in the drive by... LGTM http://codereview.chromium.org/8416019/diff/49001/content/common/sequenced_worker_pool.cc File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/49001/content/common/sequenced_worker_pool.cc#newcode523 content/common/sequenced_worker_pool.cc:523: if ...

9 years ago (2011-12-14 23:47:15 UTC) #45

brettw

I addressed Michael's comments except for the PrepareToStartAdditionalThreadIfHelpful one. It wasn't clear to me hwo ...

9 years ago (2011-12-18 16:53:53 UTC) #46

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/brettw@chromium.org/8416019/96003

8 years, 12 months ago (2011-12-27 16:51:47 UTC) #47

commit-bot: I haz the power

Try job failure for 8416019-96003 (retry) (retry) on win_rel for steps "base_unittests, safe_browsing_tests" (clobber build). ...

8 years, 12 months ago (2011-12-27 20:29:14 UTC) #48

brettw

The tests were flaky and I spent a really really long time figuring it out. ...

8 years, 12 months ago (2011-12-29 05:53:15 UTC) #49

The tests were flaky and I spent a really really long time figuring it out.
There were several problems which are fixed in the latest patch (I'm not going
to wait for another LGTM, if you have comments I'll be happy to fix in a
followup):

- (This is the only real bug.) There was a race between creating a thread and
clearing the "I'm creating the thread" flag. This prevented us from creating
threads when we could have, and sometimes deadlocked the test (which was waiting
for more threads to be created).

- I also added a long comment explaining why we must check whether we need more
threads before we actually run the test. My previous patches did this correctly,
but it was accidental. I realized this is critical and somewhat subtle so I
documented it better.

- There is an inherent race condition in thread creation with shutdown. I had to
add an extra phase of the tests to ensure that all the workers are created
before we start doing stuff. Otherwise, if shutdown was called, no more threads
will be created. I think that this behavior is OK in real life since it's only a
problem if you can post lots of tasks and then shutdown faster than you can
create threads. I think having this race is better than the alternative of
blocking on thread creating in some crazy cases, or having more than one thread
creation pending at a time (which causes other challenges). I documented this
behavior well.

- My unit test was a disaster because I incorrectly assumed if you signal an
event X times, we would allow X threads to go through the event. I forgot that
those X threads actually need to be waiting on the event! So I added a condition
variable that we wait on to ensure that all the workers are actually waiting
before they're woken up. This could be done slightly cleaner, I think by
replacing the current event with another (or the same) condition variable. But
after all the time I've spent getting these tests stable I'm loath to mess with
it, and it's pretty short and not bad now anyway.

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/brettw@chromium.org/8416019/102001

8 years, 12 months ago (2011-12-29 17:17:45 UTC) #50

jar (doing other things)

I just saw your commnent about having trouble with the unit tests. My apologies. I ...

8 years, 12 months ago (2011-12-29 18:15:45 UTC) #51

I just saw your commnent about having trouble with the unit tests. My apologies.
 I did a very intense review of the working code, and was not very careful about
your test code.  

Here are comments on the test code, which may explain why you are having trouble
(unless I'm confused about how WaitableEvent() is supposed to work).

http://codereview.chromium.org/8416019/diff/102001/base/threading/sequenced_w...
File base/threading/sequenced_worker_pool_unittest.cc (right):

http://codereview.chromium.org/8416019/diff/102001/base/threading/sequenced_w...
base/threading/sequenced_worker_pool_unittest.cc:47: // Note that this task has
started and signal anwaybody waiting for that
nit: anwaybody-->anybody

http://codereview.chromium.org/8416019/diff/102001/base/threading/sequenced_w...
base/threading/sequenced_worker_pool_unittest.cc:54: 
With murphy threads, all the tasks *could* pause at line 54, and not reach line
55 to start the Wait().  As a result, a series of signals to the event could all
be aggregated, as events don't (I think) sum up).  

This is the fundamental difference between events, and condidition variables. 
Cond vars can atomically release the lock, *and* start the wait, so it is
impossible to pause between the two points.

http://codereview.chromium.org/8416019/diff/102001/base/threading/sequenced_w...
base/threading/sequenced_worker_pool_unittest.cc:153: event.Signal();
As noted above (see line 54), this won't consistently work, and will induce
flaky bustage.

The workers can, after notifying us of their task completion, block before the
event wait().  As a result, a multitude of events.Signal() calls can be
aggregated, and the desired number of awakenings may not appear.

In addition, even if all the threads were actually blocked at event.Wait(), we
might fire off all these event.Signal() calls before any awakening had a chance
to reset the event... and hence again, signals may be lost (they aggregate into
a bool).

http://codereview.chromium.org/8416019/diff/102001/base/threading/sequenced_w...
base/threading/sequenced_worker_pool_unittest.cc:188: event->Signal();
I don't think event contains a counter, and hence I don't think sending a
multitude of Signal()s is guaranteed to awaken  the desired number of Wait()ers.

brettw

http://codereview.chromium.org/8416019/diff/102001/base/threading/sequenced_worker_pool_unittest.cc File base/threading/sequenced_worker_pool_unittest.cc (right): http://codereview.chromium.org/8416019/diff/102001/base/threading/sequenced_worker_pool_unittest.cc#newcode54 base/threading/sequenced_worker_pool_unittest.cc:54: Ah, good catch. I'll fix this.

8 years, 12 months ago (2011-12-29 18:23:16 UTC) #52

brettw

New snap up using cond vars instead of events. I think this is a bit ...

8 years, 11 months ago (2011-12-30 20:29:51 UTC) #53

brettw

I found another real bug that causes test flakyness due to a race condition on ...

8 years, 11 months ago (2011-12-31 17:38:43 UTC) #55

jar (doing other things)

See comments below. If you're not worried, and equally tired, then LGTM. If you encounter ...

8 years, 11 months ago (2011-12-31 19:06:10 UTC) #56

brettw

You're right! I moved this insertion into the list to the beginning of the actual ...

8 years, 11 months ago (2011-12-31 19:30:44 UTC) #57

jar (doing other things)

8 years, 11 months ago (2011-12-31 20:27:27 UTC) #58

lgtm

http://codereview.chromium.org/8416019/diff/117001/base/threading/sequenced_w...
File base/threading/sequenced_worker_pool.cc (right):

http://codereview.chromium.org/8416019/diff/117001/base/threading/sequenced_w...
base/threading/sequenced_worker_pool.cc:336:
threads_.push_back(linked_ptr<Worker>(this_worker));
This is nicer and cleaner too.  I like that it is right next to the clearing of
thread_being_created_.

http://codereview.chromium.org/8416019/diff/117001/base/threading/sequenced_w...
base/threading/sequenced_worker_pool.cc:540: threads_.size() < max_threads_ &&
Thinking about it it more, I remembered that you're protected from this race by
the boolean thread_being_created_.  While that is set, we never test
threads_.size(), which probably avoids the race.

Issue 8416019: Add a sequenced worker pool (Closed)

Description

Patch Set 1 #

Patch Set 2 : '' #

Patch Set 3 : '' #

Patch Set 4 : '' #

Patch Set 5 : '' #

Patch Set 6 : '' #

Patch Set 7 : '' #

Patch Set 8 : '' #

Patch Set 9 : '' #

Patch Set 10 : '' #

Patch Set 11 : '' #

Patch Set 12 : '' #

Patch Set 13 : '' #

Patch Set 14 : '' #

Patch Set 15 : '' #

Patch Set 16 : '' #

Patch Set 17 : '' #

Patch Set 18 : '' #

Messages