|
|
Created:
9 years, 1 month ago by brettw Modified:
8 years, 11 months ago Reviewers:
jar (doing other things) CC:
chromium-reviews, joi+watch-content_chromium.org, darin-cc_chromium.org, jam, michaeln, dpranke-watch+content_chromium.org Base URL:
svn://svn.chromium.org/chrome/trunk/src/ Visibility:
Public. |
DescriptionAdd a sequenced worker pool.
This allows tasks to be put in the worker pool with optional sequencing semantics for consumers that must run a bunch of stuff in order on a background thread, but don't particularly care about which thread.
Committed: http://src.chromium.org/viewvc/chrome?view=rev&revision=116078
Patch Set 1 #Patch Set 2 : '' #Patch Set 3 : '' #
Total comments: 26
Patch Set 4 : '' #
Total comments: 18
Patch Set 5 : '' #
Total comments: 16
Patch Set 6 : '' #Patch Set 7 : '' #Patch Set 8 : '' #
Total comments: 19
Patch Set 9 : '' #Patch Set 10 : '' #
Total comments: 31
Patch Set 11 : '' #
Total comments: 5
Patch Set 12 : '' #Patch Set 13 : '' #Patch Set 14 : '' #Patch Set 15 : '' #
Total comments: 5
Patch Set 16 : '' #Patch Set 17 : '' #
Total comments: 2
Patch Set 18 : '' #
Total comments: 2
Messages
Total messages: 58 (0 generated)
IMO, this whole activity cries out for using condition variables. They take care of so much of the lock work, it is amazing. Secondly, the way you're using locks here is philosophically very bad. It looks to me like you are locking code, rather than data. It is very hard to get code right when locks are used this way. Simply put, it is bad juju. ;-) That leads to lots of problems, not the least of which is that you do a lot of work under a lock, and that means that you'll tend to instigate contention. It is very easy to get deadlocks, and very hard to see them in such code :-(. With condition variables, the worker threads do the checking of the queue, and pull things off to run, instead of having a scheduling thread do the work. Condition variables also lead to the better pattern of a) grabbing lock, b) adjusting data; c) releasing lock. If you want to see some examples of this stuff, the condition_variable_unittest.cc has a bunch (to test the implementation). More specific comments below. http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... content/common/sequenced_worker_pool.cc:30: : thread_(StringPrintf("Browser worker %d", thread_number).c_str()), Thread names are generally bouncy caps, with no spaces (sometimes hyphens). This is noticeable in the about:tracking (the profiler). It is also interesting that line 30's naming convention produces an unbounded list of names (I noticed this as I was creating names for worker threads in the profiler). IMO, it is better to re-use names as they are recycled (some threads stop... then a bunch of new threads are started... and the names are recycled). Maybe this is just for debugging... and is a subtle detail for now. You might also just call them all "TokenizedWorkerThread" and not write the number into their name. http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... content/common/sequenced_worker_pool.cc:41: // This should only be called from within Inner's lock. I couldn't grok the sentence. In general, locks should protect data, not code. It is better to define the association that way, and if you expect a lock to be held prior to a call (which manipulates said locked data) I find it helpful to include the word "Locked" somewhere in the method name. http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... content/common/sequenced_worker_pool.cc:123: size_t max_threads_; I doubt that you can have a fixed number, unless it is very large. http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... content/common/sequenced_worker_pool.cc:155: // Created lazily when terminated_ is set and there are pending tasks, this Why is this lazily created? http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... content/common/sequenced_worker_pool.cc:156: // is signaled by ScheduleWork whel all blocking tasks have completed. nit: spelling: whel-->when http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... content/common/sequenced_worker_pool.cc:200: // for shutdown whenever the termianted_ flag is set. Hmm... this suggests the semantics are to sacrifice order so that you get SHUTDOWN_BLOCK tasks to run. http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... content/common/sequenced_worker_pool.cc:211: i = pending_tasks_.erase(i); IF the erase() can do anything, this is a very dangerous call. You're holding a lock, and calling an arbitrary method. At best I think you should pull out all the tasks you *plan* on erasing, release the lock, and then delete them. http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... content/common/sequenced_worker_pool.cc:236: UMA_HISTOGRAM_TIMES("SequencedWorkerPool.ShutdownDelayTime", Histograms will often be useless at shutdown time. The problem is that we have to gather the data up, and persist it, and *that* may be what we're waiting for as the last task :-/. http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... content/common/sequenced_worker_pool.cc:243: testing_observer_ = observer; nit: A DCHECK that this was not already holding a pointer would be good. http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... content/common/sequenced_worker_pool.cc:264: return; // No work to schedule or no threads to schedule them on. I don't understand why you returned her. Is this the case where we've scheduled all the important tasks.. and we notice that they are using up all our threads. Why do we want to eject at that point? http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... content/common/sequenced_worker_pool.cc:270: // This algorithm is simple and fair, but inefficient in some cases. For I see what you're going for now. You want to prioritize the task that has been waiting the longest, and give it the first open thread. OK... interesting. http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... File content/common/sequenced_worker_pool.h (right): http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... content/common/sequenced_worker_pool.h:24: // FROM_HERE, base::Bind(...)); Although I have yet to figure out how to do it, I *wish* we didn't have to see the "FROM_HERE" macro. As a result, I put it first in the argument list, so that: a) It blends with the stuff you don't have to read. b) If we ever figure out a way to make this better, we can search and replace the Post...(FROM_HERE with something better. Bottom line: I'd humbly request FROM_HERE be listed as a first arg. This will put less of this garbage in the middle of "real" arguments which the author has to get right. http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... content/common/sequenced_worker_pool.h:34: // BLOCK_SHUTDOWN behavior, you must call Shutdown() which will wait until I think we want most folks to not block shutdown. As a result, we should make it mean and obvious when they are blocking (or going to block) shutdown. For example, perhaps we need a long name like: SequenceWorkerPool::REALY_REALY_REALY_HANG_THE_BROWSER_UNTIL_DONE I want a code reviewer reading this to be VERY aware that the code is insisting on this behavior. http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... content/common/sequenced_worker_pool.h:66: SKIP_ON_SHUTDOWN, This is a confusing distinction with CONTINUE_ON_SHUTDOWN. When you say that "tasks already in progress" will be completed, do you mean you will join the thread? Wait for completion before teardown? I guess you're going after something subtle, that some tasks, once started, become shutdown blockers. IF they don't start, then they are not blockers, eh? http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... content/common/sequenced_worker_pool.h:70: // sparingly. What happens if a BLOCK_SHUTDOWN task appears in line (with a given token) after less aggressive (SKIP_ON_SHUTDOWN, etc.) tasks with the same token? Do you sacrifice the ordering rule (discard the less important tasks, and run the shutdown blocker first), or do you sacrifice the meaning of CONTINUE_ON_SHUTDOWN and SKIP_ON_SHUTDOWN, and make the browser wait for all of them? IMO, this suggests that a given toke should be tagged with a completion enum, rather than an individual task. What am I missing? http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... content/common/sequenced_worker_pool.h:77: // fail (return false) and the task will be deleted. Hmm.. that sounds like a confusing or racy definition. If you get into a queue, then you get to run... This might work... but I'm a little skeptical about racing. http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... content/common/sequenced_worker_pool.h:93: int id_; This probably needs to be an atomic, or more likely, you need a lock to protect access when incrementing. http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... content/common/sequenced_worker_pool.h:106: // PostSequencedWorkerTask(). Valid tokens are alwys nonzero. nit: alwys http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... content/common/sequenced_worker_pool.h:127: // shutdown regardless of the specified ShutdownBehavior. I guess it is good that you have weasel words. Can we do better though?
No new patch, but I tried to answer your bigger questions. http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... content/common/sequenced_worker_pool.cc:155: // Created lazily when terminated_ is set and there are pending tasks, this On 2011/10/29 02:37:14, jar wrote: > Why is this lazily created? We only need it when there's a block task running when shutdown is called. No need to slow down startup creating a lock that probably won't be used and if so, only at shutdown. http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... content/common/sequenced_worker_pool.cc:200: // for shutdown whenever the termianted_ flag is set. On 2011/10/29 02:37:14, jar wrote: > Hmm... this suggests the semantics are to sacrifice order so that you get > SHUTDOWN_BLOCK tasks to run. If there are no free threads, then we do block on any tasks to wait for a free thread. http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... File content/common/sequenced_worker_pool.h (right): http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... content/common/sequenced_worker_pool.h:66: SKIP_ON_SHUTDOWN, Yes. SKIP_ON_SHUTDOWN is the default for the current file thread (I think we empty the task queue and join with the tread, but don't quote me on that). This is what I think of as the default. I'm worried about the "we'll forget about you on shutdown" because I think it will be too easy to write crashing code. Obviously BLOCK has other disadvantages. http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... content/common/sequenced_worker_pool.h:70: // sparingly. Interesting point, I missed that. I'll move the shutdown mode to the token for the sequenced case. http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... content/common/sequenced_worker_pool.h:77: // fail (return false) and the task will be deleted. I didn't know how else to define this. I wanted a way to write files out, like the visited link file where you post tasks to the FILE thread to write little bits to it as stuff happens (in this example, you probably won't get updates during shutdown, but anyway...). I want these updates to get written if the user quits, since links will render incorrectly for subsequent runs if they get dropped. If I do a task during shutdown, what happens? Today I do BrowserThread::PostTask(BrowserThread::FILE which then drops the task if we already joined with the file thread. From reading the comments on those functions, it looks like those functions even have a third mode where Quit is already posted to the thread but the thread still exists, in which case the task is dropped but PostTask still returns true. So with this model, at least you know what happens (the return values tells you whether it's guaranteed to run or not). The only other option is to just run the task right there before returning which seems like a bad idea to me. http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... content/common/sequenced_worker_pool.h:93: int id_; This isn't changed, it's the value returned to the caller that never changes (I should probably make this const). The thing that's incremented is in the Inner class. http://codereview.chromium.org/8416019/diff/3002/content/common/sequenced_wor... content/common/sequenced_worker_pool.h:127: // shutdown regardless of the specified ShutdownBehavior. See discussion above, I don't think we can do better.
I'm thinking of this as "replace the FILE thread with something more flexible" rather than "write the ultimate thread pool." As such, it has a fixed max number of threads which is passed in to the constructor. It creates them as needed. If there are no threads available, tasks wait in a queue for one to become available. The max threads is passed to the constructor. I was going to hook this up to the BrowserThread or something which would create it with an appropriate value. I was expecting the max # threads to be relatively low (like 5-10) since in general the file thread seems mostly OK with just one thread. As a result, I didn't see much need for thread cleanup. I think this should answer a lot of your questions (like unbounded thread numbers, blocking in the waiting queue, etc.). A more normal thread pool behavior would be to create threads up to some ridiculously high number for as long as there's stuff to run on them (Windows' default max is like 500) and them aggressively delete them. This didn't seem necessary to me for the task at hand and has slightly more complexity, and I worry about system starvation. Regarding locks
To add some color to what I was talking about in the previous message... For this case, we shouldn't really have that many tasks to perform in parallel. And if we do, my intuition says I don't want to have 20 threads just because we have 20 tasks. I also start to worry about overall system CPU or I/O consumption if we're doing that much in parallel. This is a bit different than DNS lookups for the network stack which uses a regular thread pool, where you kind of want more threads for each one because you know you're just going to be waiting. Maybe my intuition on this is wrong, however.
Those comments provide a lot more motivation. To summarize: You're not looking at replacing the current worker pool. You're only looking at creating something to help with the File thread pile up. I'm pretty convinced that for such a case, a mere 2-4 servers (threads) are all you'll need to nicely handle the logjams. In any case, the task profiler I have will show what queueing delays are like for such a setup. With the current single-thread File_thread, task duration of 500-1500ms are common on my machine. The resulting queueing can also be seen.to delay numerous tasks for most of that time (at startup). Certainly anything in the above direction (multiple servicing threads) is an improvement. I had previously envisioned that several distinct threads would be combined into such a pool, but I see that is not your plan. My expectation was that a bunch of rarely used threads, could all share such a pool, and each "user" would have one token. That picture has long-lived tokens. In contrast, the picture I now think you're going for has short-lived tokens. Perchance they are so long lived, that when a single thread posts 3 tasks in a row, it might request/acquire a new token for that trio of tasks. Is that the plan?? The big reason for stopping and starting threads is (I think) memory costs. A thread carries a stack with it, and that stack may be large. Long lived threads create a stack as deep as their largest task requires.. and keep it forever. Short lived threads end up using a much smaller (on average) allocated stack depth (assuming it is committed on demand). In fact, the common limit on threads in windows has to do with 32 bit virtual address space, vs cost a stack per thread. I'll have to go back and re-read your code with this perspective (replacing File thread only). I do still believe (even with a tiny number of servers), that the code simplicity and correctness would be much better building atop condition variables. Jim On Fri, Oct 28, 2011 at 11:07 PM, <brettw@chromium.org> wrote: > To add some color to what I was talking about in the previous message... > > For this case, we shouldn't really have that many tasks to perform in > parallel. > And if we do, my intuition says I don't want to have 20 threads just > because we > have 20 tasks. I also start to worry about overall system CPU or I/O > consumption > if we're doing that much in parallel. > > This is a bit different than DNS lookups for the network stack which uses a > regular thread pool, where you kind of want more threads for each one > because > you know you're just going to be waiting. > > Maybe my intuition on this is wrong, however. > > http://codereview.chromium.**org/8416019/<http://codereview.chromium.org/8416... >
I started writing a version with condition vars and it didn't seem like it would be much better. How it works now: I create a base::Thread and PostTask to the MessageLoop of that thread the tasks. This is a bit more complicated than "normal" because I only PostTask when I know that thread isn't doing anything. There is a bunch of bookeeping info that needs to be updated from within the lock. I didn't really even think of condition vars since normally we do everything with PostTask and locks if necessary. I think there is something to be said for the familiarity that these constructs offer to average Chrome engineers, as long as it isn't at the expense of complexity. It's possible I'm misunderstanding the condition var example and how it would simplify the code. Here's what I was thinking: For the workers instead of Thread I would use base::SimpleThread which would do the condition var while() loop. If we wanted to be able to spin down threads like you were talking at one point, I'd need to add more code in there to handle that case, although I think for the initial implementation we can just "leak" the threads. But I think I'd still need the lock for bookkeeping, since a lot of tracking is required for all the different tasks. The cons var would likely allow me to eliminate the idle threads queue, but I think most everything else would stay. It seems about the same complexity to me. Am I missing something that cond vars allow me to make simpler?
drive-by: it'd be nice to have this class defined in the 'base' library, i'd like to be able to use this from within various src/webkit/... libraries outside of 'content'
A high-level API fear is that we really need an enumerated list of of psuedo-threads, rather than a dynamic token generator. IMO, too many sequences of tasks are sent within Chrome at various levels to a given thread, with the presumption of "sequencing." I don't think it is a "local thing" relating only to a function sending several tasks. I think the tasks are hierarchical, wherein for example, multiple parent tasks are posted, and each parent tasks sends numerous child tasks. We expect the sequence of children tasks to be a very special traversal order, and I don't think we can emulate that easily with local sequence numbers. In this regard, I think your design is too general. I'm still left feeling that we're pounding in a nail (task) with a jack-hammer (message loop), but I'm not clear how much perf difference will be involved (compared with condition variables), and how much we care (given we might have larger tasks on this file-like tread). I pointed out some of the visible issues in comments. Since I wasn't clear enough about the issue of holding locks in functions, vs locking data, I gave a number of explicit examples of where the technique is used in this CL. I also gave some examples of how the code could be made cleaner (relative to my targeted locking philosophy). http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor... File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor... content/common/sequenced_worker_pool.cc:41: // This should only be called from within Inner's lock. It is surprising to require a lock (when posting a task). http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor... content/common/sequenced_worker_pool.cc:46: thread_.message_loop()->PostTask(sequenced_info.location, task); It is more surprising to see a local lock held as we post to a message loop, which acquires yet another lock. I expect this is not necessary... but need to think more about it. Preliminary thoughts are that we're using a powerful tool, a message loop, to do some very minimal work: run one task at a time with no pending queue. It is kindred to using the super fancy (and heavyweight) thread-safe-observer, when all that is needed is a permanently registered callback function. The net-net is that rather than the cool efficiency of a message loop, which asymptotically needs a remarkably mere one lock acquisition, and no signaling, to service a task, we now seem to require 3 lock acquisitions, and 1 signal, for every task. We no longer gain any efficiency under load. The uber lock that is held has a lot of potential for contention, as it is held for a long time, and held while acquiring other locks (which can then block it). I tried to think about the condition variable implementation. It would require (under load) asymptoticallly two locks acquisitions, and no signals. You could argue that the performance of this class is not critical, as we often run slow tasks in it (which will massively dominate any wimpy locking and signalling issue). On the other hand, I see that man tasks on the file thread are indeed fast... and I'd expect that more tasks would be tossed over here if we had multiple thread servers. http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor... content/common/sequenced_worker_pool.cc:76: class SequencedWorkerPool::Inner Suggest you add a comment to clarify that this is a singleton (if indeed it is). http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor... content/common/sequenced_worker_pool.cc:133: std::deque<Worker*> idle_threads_; You should use a stack, as I don't think there is a good reason to kill these threads (and certainly you're not yet doing such to justify a deque). http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor... content/common/sequenced_worker_pool.cc:168: SequencedWorkerPool::Inner::GetSequenceToken() { Cool as this is... I'm suspicious that we need an API that uses an enumerated list of tokens, rather than ad-hoc ones. I *think* that it will be less common to be able to post two or three tasks with a sequence token, and have that suffice. I expect that the task that is running (that posts additional tasks) will commonly be part of some outer sequenced stream of tasks, and all children will together need to be sequenced. This is accomplished today with our enumerated list of distinct threads. I think we should stay with that enumeration model, and just have the freedom to assign a sequence (enum) to any worker thread at any time. (even though this will wreak a bit of havoc with the profiler... but that is should be my concern... and not an issue for your design). http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor... content/common/sequenced_worker_pool.cc:196: void SequencedWorkerPool::Inner::Shutdown() { This function is written as a blocking call. I much prefer asynchronous calls, especially since we're living in a message loop world (where we shouldn't block a loop). What thread are you anticipating will run this call (and block)? I'm a little wary of a deadlock here as well as performance issues. http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor... content/common/sequenced_worker_pool.cc:211: i = pending_tasks_.erase(i); Does this potentially induce a task destructor? Do we have assurances on how long such things take? I've often wondered if task should not only have a Run() method, but they should also have a DiscardedWithoutRun() method that should presumably do MUCH less work. My philosophy for locks suggest that we should move all the deletes outside the lock. You can easily do this here. You should put all the tasks you're thinking of erasing into a separate container, and then delete them after you release the lock (you might have a self deleting container at function scope, outside the locked scope, for this purpose, and then you can just do the return statements on line 219). http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor... content/common/sequenced_worker_pool.cc:246: void SequencedWorkerPool::Inner::ScheduleWork() { You might consider adding a comment that this function will schedule at most, one new task, and possibly less if there is nothing to do. I'd suggest returning a bool to indicate if you succeeded in scheduling work. http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor... content/common/sequenced_worker_pool.cc:279: // next one from that token right away. That is a good optimization for efficiency... but destroys your fairness promise. For example, if there are N worker threads, then once we have N distinct sequence tokens running, they can hog all thread usage until they run out of tasks :-(. Methinks that may be what you're saying in the next paragraph. http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor... content/common/sequenced_worker_pool.cc:296: Worker* worker; Should this be a linked_ptr? http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor... content/common/sequenced_worker_pool.cc:313: worker->PostTask(*i, base::Bind(&Inner::RunTask, this, worker, *i)); Now that we have the worker thread out of the idle_threads_ stack, there is no rush to post this task (while holding the lock). The current structure makes this a little difficult (since the whole function scope is locked on entry), but we could (at line 316) AutoUnlock the lock briefly to be cleaner about the construction of the Bind(), and the posting of the task (some refactoring would probably be yet cleaner... such as returning a runable task... but there are probably better/cleaner ways as well. http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor... File content/common/sequenced_worker_pool.h (right): http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor... content/common/sequenced_worker_pool.h:59: // it's best to use this only for slow but simple operations like the DNS DNS might not be a good example. DNS probably needs to run on full blown worker-threads. They have the potential for really hanging around... and there are often a lot of them (at least 8, and more with retry logic) in parallel. http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor... content/common/sequenced_worker_pool.h:65: // will be completed. What is the alternative to "will be completed?" There is no nice way to kill a thread that is uncooperative. When you say "will be completed" are you noting that the thread this is run on will be joined, and hence process shutdown will wait for it? How else would you assert "will be completed?" I'm confused. I suspect you can only talk about run semantics up to the point you run or discard. Once you start to run, you can't abort it, but can delay process termination to wait for it. http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor... content/common/sequenced_worker_pool.h:75: // If a task is posted during shutdown, it will not get run since the Just to be clear... I think your rule is: "If this is successfully posted, this task will be run to completion." http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor... content/common/sequenced_worker_pool.h:131: const base::Closure& task); I'm still feeling the arg order is wrong. I think it should be from_here, then Closure, then shutdown behaviour, then sequence token. That would make it look more like post task, and post delayed task, where the delay element is last (optional), and from here is always first. http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor... File content/common/sequenced_worker_pool_unittest.cc (right): http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor... content/common/sequenced_worker_pool_unittest.cc:16: // tests should not be flaky, and hangling indicates a type of failure. Do not nit: typo hangling->hanging
On 2011/11/23 20:08:16, jar wrote: > A high-level API fear is that we really need an enumerated list of of > psuedo-threads, rather than a dynamic token generator. IMO, too many sequences > of tasks are sent within Chrome at various levels to a given thread, with the > presumption of "sequencing." I don't think it is a "local thing" relating only > to a function sending several tasks. I think the tasks are hierarchical, > wherein for example, multiple parent tasks are posted, and each parent tasks > sends numerous child tasks. We expect the sequence of children tasks to be a > very special traversal order, and I don't think we can emulate that easily with > local sequence numbers. In this regard, I think your design is too general. Can you clarify what I should do in response to this comment?
http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor... File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor... content/common/sequenced_worker_pool.cc:168: SequencedWorkerPool::Inner::GetSequenceToken() { I think an enum destroys the whole point of this patch. In a previous comment michaeln requested this be in base since he wants to use it in src/webkit. I'm not going to move it right away, but how could this be general enough to maintain a global list of enums for the browser while living inside of base? What would the enum look like if we had it? What possible meaning would we assign to different threads that people could understand? Is there an example of a multilevel sequencing problem using the FILE thread that you're worried about? The vast majority of the tasks posted to the file thread are independent. There are a handful of cases like the visitelink master that post updates to the same file, and this class is attempting to solve. http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_wor... content/common/sequenced_worker_pool.cc:196: void SequencedWorkerPool::Inner::Shutdown() { This has to be the main thread. It has to block since some tasks have the semantics of "block shutdown until I complete".
I think the enum list is the natural way to go. Your concern in a follow up email is how to put this in base, and still have an arbitrary list? To handle that, what you'd need is a more dynamic (rut-time) enumeration. I'd suggest that a user can call using a string, and that you hash the string into an int which is a sequence number. The interface should be set up to make that work easily (i.e., probably the call to post a task takes a string name.) The hash could be deterministic (MD5), or simple and collision free (we asign atomic names to each string used). Such an interface would simulate an arbitrary number of threads (and we can be adaptive about how many we make!), but would have the ability to dynamically assign things to a single thread. I do like the idea of being able to say "just run it" with no serialization (you used id == 0 in your CL). I can believe that can be deduced locally (knowing only the task, and having no care about the cause). If you went with strings, perhaps "" would be the "just run it with no serialization mantra. Jim On Wed, Nov 23, 2011 at 12:43 PM, <brettw@chromium.org> wrote: > On 2011/11/23 20:08:16, jar wrote: > >> A high-level API fear is that we really need an enumerated list of of >> psuedo-threads, rather than a dynamic token generator. IMO, too many >> > sequences > >> of tasks are sent within Chrome at various levels to a given thread, with >> the >> presumption of "sequencing." I don't think it is a "local thing" relating >> > only > >> to a function sending several tasks. I think the tasks are hierarchical, >> wherein for example, multiple parent tasks are posted, and each parent >> tasks >> sends numerous child tasks. We expect the sequence of children tasks to >> be a >> very special traversal order, and I don't think we can emulate that easily >> > with > >> local sequence numbers. In this regard, I think your design is too >> general. >> > > Can you clarify what I should do in response to this comment? > > http://codereview.chromium.**org/8416019/<http://codereview.chromium.org/8416... >
On Wed, Nov 23, 2011 at 1:01 PM, <brettw@chromium.org> wrote: > > Is there an example of a multilevel sequencing problem using the FILE > thread that you're worried about? > The problem is that I'm not sure how to migrate code that used to post to the file thread. I'd much rather (when evolving sync code, for example), post to the "SyncFile" thread where I used to post to the "FileThread." I think I see a faster path to migrate code. The granularity is much coarser than your proposal with guids being generated on demand, but I don't see conflicts in the profiler that call for such extreme measures. The logic under this enum would remain the same... I just don't see Jim > > The vast majority of the tasks posted to the file thread are > independent. There are a handful of cases like the visitelink master > that post updates to the same file, and this class is attempting to > solve. > > > http://codereview.chromium.**org/8416019/diff/7001/content/** > common/sequenced_worker_pool.**cc#newcode196<http://codereview.chromium.org/8416019/diff/7001/content/common/sequenced_worker_pool.cc#newcode196> > content/common/sequenced_**worker_pool.cc:196: void > SequencedWorkerPool::Inner::**Shutdown() { > This has to be the main thread. It has to block since some tasks have > the semantics of "block shutdown until I complete". > > http://codereview.chromium.**org/8416019/<http://codereview.chromium.org/8416... >
On Wed, Nov 23, 2011 at 2:13 PM, Jim Roskind <jar@chromium.org> wrote: > > > On Wed, Nov 23, 2011 at 1:01 PM, <brettw@chromium.org> wrote: >> >> Is there an example of a multilevel sequencing problem using the FILE >> thread that you're worried about? > > The problem is that I'm not sure how to migrate code that used to post to > the file thread. Â I'd much rather (when evolving sync code, for example), > post to the "SyncFile" thread where I used to post to the "FileThread." Â I > think I see a faster path to migrate code. Â The granularity is much coarser > than your proposal with guids being generated on demand, but I don't see > conflicts in the profiler that call for such extreme measures. Â The logic > under this enum would remain the same... I just don't see Okay, your suggestion makes more sense to me with this example. I was envisioning that in these cases some kind of "main class" would provide the sequence token to interested parties to use. We do this for property bags (eg HtmlDialogUI::GetPropertyAccessor()) which seems to work well, but could be awkward in some cases for a very decentralized component with no natural place to put this accessor. I'll probably just make a map from string name to ID so I can keep using ints internally and the conversion is super simple to understand. Brett
Michael: I'm going to leave this in content for now. We can move it later. I'm not entirely sure how you would use it in WebKit, however. It's not a singleton, somebody needs to configure the number of threads it has and own it. I was going to have BrowserThread do this, so it wouldn't be available to the webkit directory anyway. You could have a separate one for WebKit, but that sounds wasteful. We'll have to figure out how this works. Conceivably, we could have one in base and have content / the browser configure it.
I uploaded a new patch using condition variables. The code is longer. Currently it deadlocks and I've yet to figure out why. I have to do extra work when posting a task to see if adding another thread can legitimately do work. I can get around this brute-force search of all tasks (line 240) by adding more tracking information, which will make it even more complicated. I don't think this is an improvement. My previous code never had any deadlock bugs and was easier for me to debug. I did not have to review how condition variables work, and people reading and maintaining the code wouldn't have to either (everybody understands PostTask and locks). Figuring out whether adding another thread would be useful was easier since tasks were explicitly scheduled rather than "consumed" (avoiding the race I explained in the comment). I think we should use the previous version.
Thanks for looking at the alternative. As the comments below show, you didn't fully drink the kool-aid for condition variables and threading. You have to trust the threads to do more work, and make autonomous decisions. You need very little state in the centrol locked singleton. You should generally have only a few counters, plus the set of active tokens. Deadlocks are very hard to induce with condition variables, as there is very little done under a lock. I suspect you got tangled up with the waiting event that you had from your previous implementation. You should (as noted below) wait for shutdown by waiting on the CV. I think that you're also going to reap rewards in shutdown performance, as (per comments below) you get to do cleanup of discarded events on worker threads (instead of on "a main controlling thread." You'll also get the cleanup done while you're waiting for the last critical worker threads to complete. http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo... File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:100: bool RunnableTasksAreOverSubscribed() const; I don't think you'll need this function. http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:133: std::set<Worker*> running_threads_; I'd expect each thread to hold its own status. At best, I'd expect counts to be maintained in this controlling singleton. http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:158: scoped_ptr<base::WaitableEvent> shutdown_complete_; The condition variable would be the natural thing to wait for, even on a non-worker thread. http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:246: // of task and free thread counts would indicate that a new thread should be I think what you left out of this argument is that the thread that was just started to handle the unused sequence token will be entrusted to create another thread if need be. So long as you know there are threads waiting for your signal, you should let them handle the work of starting a (single) new thead if need be. http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:250: // Therefore, we do a more thorough brute-force check of runnable tasks to I think we should just maintain a few counters, and let them form the "conditions" that are checked, and there would be no new race, or complications. When someone posts a task, if we have more threads running then we have sequence tokens in the active set, then everything will sort itself out (no need to figure out if you should start yet another thread or not... the extra running thread will do that soon enough). The only time to start a new thread is when all threads are busy, and the new token is not in the active set. There is then a race of sorts, as an active thread can always race to finish against the startup of the new thread, but that is inevitable with any implementation. No matter who starts a new thread, it is possible that instantly we'll have all the other threads finish their work (they can even be waiting for the lock as your deciding!), leaving us wondering why we started a new thread. The difference is that with conditional variables you notice this race (logic is simpler, and you can see it), whereas with the previous implementation you started a message loop and task, and ignored the startup time (which was actually much more extensive). Bottom line: a race can always cause us to create one more thread than we "could have gotten away with." http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:258: // it's not necessarily fast and it could block other work), but avoiding All you do inside the lock is bump the counter on the number of started threads. You then exit the lock, and start the thread. http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:260: threads_.push_back(linked_ptr<Worker>(new Worker(this, threads_.size()))); I don't think you need to maintain a list of threads. I think they should be self owning. As a result, I don't think you need this push inside the lock. I think you need a count of the number of threads that need to be waited for at shutdown. That is what any caller that wants to effectivel join the threads should check for as a "condition" when they are waiting for those urgent threads to complete. http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:292: dropped_tasks.push_back(*i); This cleanup action should happen in a worker thread. This will also keep such cleanup off the critical path of the thread that called shutdown, unless you want to assert that cleanup (delete of discarded tasks) must be waited for. If so, then worker threads should each take it upon themselves to delete one discarded task at a time... and they'd all run in parallel. http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:306: shutdown_complete_.reset(new base::WaitableEvent(false, false)); Instead of using a waitable event, you should wait on the condition variable, testing to see that cleanup and runs are complete.
http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo... File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:336: task.task.Run(); One other thing... it is common to signal after you take on work, to have some other worker check to see if there is more work. You could spend the time in this thread seeing if you "need help" servicing everything, or you could just signal, and let some other thread check (in parallel) to see if there is more work to do. That missing line might be a part of your deadlock also. Since you did go with a lazy start of threads, you'll also need a boolean to indicate if you should spawn another worker just before you start to do your "real" work (the call of Run()). The entity that decides to spawn a new thread should bump up the count on active threads (inside the lock), so there is no race or confusion about the count of active threads.
http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo... File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:292: dropped_tasks.push_back(*i); How can you guarantee that there is a worker thread available to run the discarded event? They may all be running CONTINUE_ON_SHUTDOWN tasks and will never run more code from us again. http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:306: shutdown_complete_.reset(new base::WaitableEvent(false, false)); You want me to wait on the same condition variable I use for the workers? That seems super confusing to me. http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:336: task.task.Run(); On 2011/11/26 01:25:59, jar wrote: > One other thing... it is common to signal after you take on work, to have some > other worker check to see if there is more work. You could spend the time in > this thread seeing if you "need help" servicing everything, or you could just > signal, and let some other thread check (in parallel) to see if there is more > work to do. > > That missing line might be a part of your deadlock also. You're right, that's what was causing my problem. I think this adds to my argument that condition variables are more difficult to understand than the normal Chrome constructs. > Since you did go with a lazy start of threads, you'll also need a boolean to > indicate if you should spawn another worker just before you start to do your > "real" work (the call of Run()). I'm going to check the queue every time I take something out. It's difficult to figure this out ahead of time since there may be multiple threads required. > > The entity that decides to spawn a new thread should bump up the count on active > threads (inside the lock), so there is no race or confusion about the count of > active threads.
I continue to think this is the perfect job for CVs. I think this sort of application is probably the most typical example of what CVs are used for (though you've added some interesting semantic twists for the serialization). I also see your current trip towards CVs as being valuable, and illuminating a number of less clear elements of the spec (state machines tend to be very forthcoming about covering all possible transitions). I have the sense that you're really opposed to these... and really think the message loop hammer is a better approach. ...but since you're writing the code, go with what you feel comfortable with, and I'll comment in that domain. http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo... File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:292: dropped_tasks.push_back(*i); Interesting question. I hadn't thought about that as a requirement. It is a great question. Do you want the semantics to guarantee destruction of discard tasks? I'm surprised to see you suggest that you're willing to wait for them to get destroyed, but yet you're not waiting for them to complete a run :-/. IMO, given that you're willing to leave some non-critical tasks running, it is no big deal to leave some tasks undeleted if you get "too busy." I'd expect that more typically, the non-critical tasks would finish, and you'd soon be down to a few critical tasks, and a bunch of idle threads... just waiting to do your deletions. Your question is very similar to the question of what to do if you have all your threads "stuck" running non-critical tasks, but you need to run critical tasks at shutdown. What do you think the right thing is to do? Do you want the semantics to be that the "must run" tasks will be run only on a restricted set of worker threads? Are you willing to sacrifice the thread count limit? ...or is that firm, and you'll wait indefinitely for a time-slot on a busy worker thread? ... or do you want to try to run the "must run" tasks on the thread that calls for a shutdown, since it was going to wait indefinitely anyway... hmm. My guess is that this is a great question... and we should settle on the semantics. WDYT? Hang? At the heart of this, the current official release has a shutdown problem that revolves around a non-critical(?) task hanging, and blocking all availability of "the thread pool" (a single File Thread) from running a critical task. I think that during shutdown, you have to make some extra threads available to ensure that you can service the critical tasks. I suspect that otherwise, we'll sometimes get a bunch of hung threads, and have no thread power to handle the shutdown. This "reserve tank" approach is commonly used in memory allocators to facilitate more controlled crashes when we're "out of memory." Maybe we need a reserve number of threads for such emergencies. On 2011/11/26 01:54:40, brettw wrote: > How can you guarantee that there is a worker thread available to run the > discarded event? They may all be running CONTINUE_ON_SHUTDOWN tasks and will > never run more code from us again. http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:306: shutdown_complete_.reset(new base::WaitableEvent(false, false)); On 2011/11/26 01:54:40, brettw wrote: > You want me to wait on the same condition variable I use for the workers? That > seems super confusing to me. Clearly the goal is to wait until the system achives some "state" (all required threads have run). It seems really natural to wait for such an event. CVs are really good about avoiding the race between checking state, and waiting. CVs can be built atop waitable events, and vice versa. http://codereview.chromium.org/8416019/diff/19001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:336: task.task.Run(); On 2011/11/26 01:54:40, brettw wrote: > I'm going to check the queue every time I take something out. It's difficult to > figure this out ahead of time since there may be multiple threads required. > The trick is to just move incrementally in the right direction. You don't need to start a lot of threads. At most, one more... and that thread, when it starts, may choose to start another. IMO, the simple approach is to always have one extra on hand. Any time you run a task, if you think your are the last idle thread, and you're belowe the limit, you should start another. As I said, you should try to make it simple, and just look at the few state variables (the conditions) to see what you should do. If you want to be fancy, when you call the function that finds the next task that is eligible to run, that function can return a bool indicating if there are other tasks that could have been run as well. That would give you the minimal thread creation.
I thought about it more, and think the deletion question for un-run tasks may be even more significant. I *think* that it may be problematic to delete pending tasks before corresponding (sequenced) tasks have completed, even if the already-running-tasks were marked as non-critical. The problem is that the would-be-subsequent tasks might delete things that are being processed by a current task. The (closest) current example would be a task labeled as delete-soon. It is possible to send a task to do something, and then send a sequentially subequent task that deletes an object. Depending on how that deletion task is crafted, the mere act of deleting the task might achieve the deletion of the object in question. Bottom line: I think that IF you're going to delete tasks that have not been run, you should delete them in a sequenced fashion. I *think* we sequence the destructions today on our named threads, after we've stopped running tasks. If you buy that approach, then the "right" place to delete tasks is probably on the worker threads (though you could do it on the shutdown request thread). The easy(?) way to do the sequential deletion on a worker thread is to bind a task into a deletion closure (whose execution will drop the ref), and then run that closure (in lieu of the actual task). In any case, it was a great question to ask about the semantics of destruction of the un-run tasks.
Question: If you use a message loop based implementation, how would you correctly(?) handle having a sequenced-task (try to) post another task to the "current" message loop? Wouldn't that bypass your efforts at fairness? You could ask folks to "not do that," but migration of existing code would be complicated, and it would be hard to spot this routine action as illegal. Maybe you'd use a message loop which refuses to reveal itself as current? Points at the sequencing queue instead?
I agree that the cond vars are the correct CS way to solve this problem, and they will be faster than the message loops. Now that I have it working, I don't think we should go back. But it doesn't seem clearly better (some parts are nicer, some parts are more subtle, it's still more LOC) and I don't think the time investment was worthwhile to change it. This patch is a nice improvement on the previous cond var patch set. I like the spawning of threads from the background threads in more cases, but this is one of the subtle cases that's difficult to see how it works just reading the code. I had to add more exit code handling and removed some of the tread tracking you pointed out. I did not change the shutdown event. The current shutdown semantics, which I'm happy with, are that we just run critical tasks as fast as possible on shutdown, blocking on already-running non-critical ones if necessary. I'm assuming tasks won'g hang indefinitely and we'll eventually run them. The semantics are the same as for a MessageLoop thread except that some tasks won't get run. I don't foresee this coming up much and I don't think it's worth adding some elaborate code to handle it.
> Bottom line: I think that IF you're going to delete tasks that have not been > run, you should delete them in a sequenced fashion. I think I need to delete the tasks to keep the memory bots happy. Personally, I'd rather drop them on the floor. I agree that the tasks should be deleted in sequence. This is a pretty far out edge case, but if we ever hit it the crash will be impossible to debug. I'll add that.
I see the problem now. We can't delete tasks in sequence without extra threads. I'll have to think about that more...
I see the problem now. We can't delete tasks in sequence without extra threads. I'll have to think about that more...
New snap up. This does a best effort to delete tasks while keeping everything sequenced. I think continue on shutdown tasks will be rare and this probably won't come up enough to make the memory bot significantly flaky.
On 2011/11/25 18:33:38, brettw wrote: > Michael: I'm going to leave this in content for now. We can move it later. I'm > not entirely sure how you would use it in WebKit, however. It's not a singleton, > somebody needs to configure the number of threads it has and own it. I was going > to have BrowserThread do this, so it wouldn't be available to the webkit > directory anyway. Take a look at your .h and .cc files. There are no dependencies on content (content_export.h doesn't count), so why put this class in the content library? > You could have a separate one for WebKit, but that sounds wasteful. We'll have > to figure out how this works. Conceivably, we could have one in base and have > content / the browser configure it. There's no good reason to have separate instance that i can see. Let's take the time to think about it before the initial commit. Here's one approach... - the SequencedWorkerPool class is defined and implemented in the base library - somewhere in chrome/content an instance is created and configured - somewhere in test_shell/chromiumDrt an instance is created and configured - references to that instance are passed to src/webkit object ctors (maybe the SequencedWorkerPool s/b refcounted and have an explicit Shutdown method) I think any approach would be predicated on the SequencedWorkerPool class being defined in the base library, so let's do that much from the git go.
On 2011/11/28 21:30:51, michaeln wrote: > On 2011/11/25 18:33:38, brettw wrote: > > Michael: I'm going to leave this in content for now. We can move it later. I'm > > not entirely sure how you would use it in WebKit, however. It's not a > singleton, > > somebody needs to configure the number of threads it has and own it. I was > going > > to have BrowserThread do this, so it wouldn't be available to the webkit > > directory anyway. > > Take a look at your .h and .cc files. There are no dependencies on content > (content_export.h doesn't count), so why put this class in the content library? This isn't used by anybody yet. It's intended to be a replacement for the FILE thread, and the FILE thread lives in content. How do you do today what you want to do with this thread pool?
> How do you do today what you want to do with this thread pool? MessageLoop and MessageLoopProxy are defined and implemented in base. Libraries implemented in webkit utilize MessageLoopProxy to schedule tasks on appropriate threads that are created by chrome/content. References to the appropriate loop proxy are provided at ctor time.
Again... if you're hating CVs... and havin' to listen to me yap... please feel free to revert to be heavy hammer mesage loop. Comments include a combination of places where the code was longer than need be, and some general comments. You're very close to a clean CV implementation IMO. I'm betting it will be smaller than the message loop variant (if you got the message loop variant to handle as much stuff). I'm ok with the apparent plan to discard (not delete) tasks that are not run, but you should document that in the headers. http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo... File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:107: std::vector<linked_ptr<Worker> > threads_; You don't need to have a list of threads. All you really needed is: active_threads_count_ The current threads_.push() become an increment threads_.size() is just active_threads_count_ ...and your join() is just a wait on the CV for the count to reach 0 (which will happen slightly before you'd hear about the thread fully finishing via a join(). You just have to decrement the counter when you decide to exit your main run loop (at line 302... and you should signal()). http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:187: cond_var_.Broadcast(); This works... but assuming each thread will, if it finds any work to do, signal for others to check out the state (condition), you don't really need a broadcast(), and a mere signal() will do. In general, broadcast is (I think) rarely used. All the threads would wake up, only to find that they couldn't access the lock, and then they'd go to sleep waiting :-/. Signal() tends to nicely sequence the tasks as locks are released. Broadcast is safe... and maybe this is NBD, since perf is not that critical... but I think signal() is mostly a better pattern to stay with. Also note: Signal() and Brodacast() can both be done outside of a lock... and usually that is better, since the awakened threads have a chance at the lock being available. http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:242: cond_var_.Signal(); nit: Nicer to signal when after you release the lock. IMO, as a cleaner alternative, you should just always signal() when you release the lock in the main run loop. That means both just before run(), and when you exit the loop. http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:260: // There are no pending or running tasks blocking shutdown, we're done. This is perfect. You've effectively checked the condition (under the lock), and decided that everything is done. That is indeed the clean way to use a CV. If you changed the "if" to a "while" and negated the condition, you could put a CV wait here and be done. ...and you would not have to mess with the waitable event.... which is what you effectively realized in the if statement above. http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:347: i = pending_tasks_.erase(i); So you decided to leak the less important tasks? I thought you were planning on deleting them in sequence on these worker threads. http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:348: pending_task_count_--; if you add a "continue;" here, you won't need ane else clouse on line 349, you'll have less indentation, and it will be more readable IMO. http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:355: blocking_shutdown_pending_task_count_--; Lines 353-355 are critical accounting lines, and keep everything in sync. Similar accounting is done in lines 347-348. If you refactor this code, then you'll only need to do this once (and have less chance of messing up if you change your accounting). http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:362: ++i; This was harder to read than needed, because this clause is so far away from the if(...) in line 338. IMO, it would be better to negate the condition in line 338, put these two lines immediately after line 338, and then add a "continue;" so you don't need an else clause at all. http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:436: for (std::list<SequencedTask>::iterator i = pending_tasks_.begin(); I was hoping/expecting that you wouldn't need to start a thread under a lock. I also noticed that this function, as written, is a little wrong, as it will start threads at shutdown even if the only tasks that are available don't block shutdown. IMO, this iteration code should be replaced by a variant of GetWork(). You should adjust GetWork() to accept a null pointer, and then the boolean return value would tell you if there was a need for a new thread. You'd have to call it just after line 293 (when you adjusted your sets and counts). Once it was called at line 293, it would be simple to assign a result to a boolean so as to create the new thread outside the lock (after you autounlock in line 295. http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo... File content/common/sequenced_worker_pool.h (right): http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo... content/common/sequenced_worker_pool.h:52: // deleted rather than run, and any tasks with this mode running at This needs to be update. I think you are leaking them now... but I didn't check all cases... you might even be deleting some, and not others. http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo... content/common/sequenced_worker_pool.h:70: // will be deleted rather than executed. However, tasks already in progress Semantics change: you don't delete. http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo... content/common/sequenced_worker_pool.h:83: // fail (return false) and the task will be deleted. I don't think they should be deleted if we're leaking others, especially if they might break the currently active run() of prior sequenced tasks.
New snap up. I'll move this to base but I didn't want to do it now or it would interfere with the diffs. I'll do that right before checking in. http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo... File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:107: std::vector<linked_ptr<Worker> > threads_; This won't work. I need the actual SimpleThread* to explicitly call Join on it in the destructor, or it asserts. http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:187: cond_var_.Broadcast(); Okay, done. http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:260: // There are no pending or running tasks blocking shutdown, we're done. I did this but couldn't write the loop here to keep the semantics of the testing observer, which made the result a bit more complicated than you probably envisioned. I had to tweak the termination condition as well. http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:347: i = pending_tasks_.erase(i); This is the worker thread, but I forgot to do it outside of the lock. Because of the sematics of closures, doing this got a bit more complicated. http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:355: blocking_shutdown_pending_task_count_--; It does not seem clear to me that there's an obvious refactor beyond what my new patch does. http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:436: for (std::list<SequencedTask>::iterator i = pending_tasks_.begin(); I don't want to re-use GetWork. It's already complicated enough. And my additional code to properly delete tasks outside of the lock would actually make this somewhat tricky to use in this location since this function would also need to handle task deletion, and proxy that out to the calling function. Because I need to keep an accurate accounting of all threads to properly join them in the destructor, it's difficult for me to see a solution that does not create the thread in the lock. It seems any attempt to do that will result in a shutdown race that can miss the thread, which will cause a flaky assert in ~SimpleThread.
Actually, I think I figured out how to create threads outside of the lock. We need another counter (that's updated inside the lock) of threads currently being created. We also wait on this counter for the shutdown event, so we're guaranteed that all threads are created by the time we shutdown.
Thnx! > I'll move this to base but I didn't want to do it now or it would interfere with > the diffs. I'll do that right before checking in.
Thread creation now happens outside of the lock. It's kind of subtle, I hope the benefit is worth the complexity.
Most of my comments focus on chasing the following pattern (which you mostly do already). I'll just give the pattern so that you can argue against it if you think it is not right. There is a general pattern that makes this class of code easy to write, and offer very little surprises. The general approach is involves almost always signaling when you release the lock. Here is the more detailed formulation: a) Do any stuff you can before you take the lock. b) Acquire the lock, and decide what you should do, setting local variables to remind you (when you release the lock) what to do. If you have nothing to do (then other threads have nothing to do!) then go into a wait state. c) Release the lock. (sometimes via AutoUnlock... but this is your choice based on loop structure convenience). d) Since you had work to do, you can almost routinely signal to see if others can find work to do. This also takes care of the case where you've modified the state (conditions) to get others to do stuff. e) Do what you planned on doing (based on local flags etc.). f) Go back to (a) (after cleaning up after what you did). The above pattern include shutdown: b1: You decide you should shutdown. c1: Release the lock (often by "break"ing out of a loop, and generally not via AutoUnlock) d1) Signal (which gets other folks to shutdown). You've moved to doing this more, but the above is super safe and simple. Holding the lock as briefly as possible (example: avoid doing big work under the lock, and only change states) ensures there is little contention from other threads looking for work to do. It also precludes almost all deadlocks. Similarly, the standard of signal-when-unlocking (other than when go into a wait state) means you don't need to worry about checking of other threads might find work: they can just check for themselves. http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo... File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/29002/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:107: std::vector<linked_ptr<Worker> > threads_; Hmm.... this is a confusing requirement. What is the assert concerned with?? I'll look at the code. On 2011/12/01 02:44:57, brettw wrote: > This won't work. I need the actual SimpleThread* to explicitly call Join on it > in the destructor, or it asserts. http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:14: #include "base/threading/thread.h" nit: alphabetize http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:54: Inner(size_t max_threads); nit: explicit for constructor http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:126: size_t max_threads_; Is this const from construction time? http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:175: StringPrintf("BrowserWorker%d", thread_number).c_str()), Since this class appears to be designed so that we can have several distinct pools, you should probably have the prefix-name (here shown as "BrowserWorker") as a parameter to the construction, which is held in inner. http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:256: sequenced.task = task; Although not a big deal, lines 252-256 should as a matter of standard practice, be performed outside the lock. http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:334: base::AutoUnlock unlock(lock_); This inner loop looks excellent. I'd suggest doin a signal() just after the unlock. http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:345: task.task = base::Closure(); :-) Perfect. http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:347: DidRunWorkerTask(this_worker, task); // Must be done inside the lock. Perhaps I'm a little confused here... was the task we just ran deleted while the lock was released (just after it was run)? If so, all we need to preserve from line 332 is the two bits of state that can be fed in here to DidRunWorkerTask() (re: was this a BLOCKING task, so we need to fix the count, and what was the sequence number that we have to pull from the set<>). http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:362: cond_var_.Signal(); If you add one more scope to hold the AutoLock() on line 326, so you'd have a place to place the signal() outside the lock. http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:428: delete_these_outside_lock->push_back(i->task); // See above. I *think* you take care of deleting this yourself separately. In addition, if you want to go this route, you should do the clear of delete_these_outside_lock *after* you run the task around line 345.. ...or am I confused? http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:473: cond_var_.Signal(); As mentioned above... is is simpler to always signal when you release the lock to do some work... so you won't need to signal here. You *could* be more precise about the signalling.... but it is much easier to see everything is perfect if you don't bother (and you avoid checking for "more work" in one thread, and checking for "some work to do" in the signalled thread). http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:483: DCHECK(blocking_shutdown_thread_count_ > 0); nit: DCHECK_GT http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:513: // refuse to create more threads or they would be leaked. IMO, when you do the preparation for starting a new thread, you should update (here and now) all the counts you care about. That way, you won't have to wait for the thread to really start and do the additions to the state. http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:525: thread_being_created_ = true; This is a nice flag.... as it can keep us from starting two thread when one will suffice. http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:526: return threads_.size() + 1; This is where you should do all the state changes to account for the new addition of a thread. You should effectively account for the thread as though it is asleep (waiting). IT will soon startup, and will go through all the checking that a thread would do when it wakes up. http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... File content/common/sequenced_worker_pool.h (right): http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.h:52: // deleted rather than run, and any tasks with this mode running at Your comments came later about how deletion is not guaranteed... but you should update this to at least say "deleted (if time permits) rather than run.." http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.h:70: // will be deleted rather than executed. However, tasks already in progress again: "...deleted (if time permits) and will not be executed." http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.h:101: SequenceToken(int id) : id_(id) {} nit: add explicit http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.h:109: virtual void WillWaitForShutdown() = 0; nit: add virtual destructor. http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.h:112: SequencedWorkerPool(size_t max_threads); nit: explicit http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.h:154: bool PostSequencedWorkerTask(SequenceToken sequence_token, I'm tempted to believe that tasks should by default be required for shutdown. I think it takes a lot of thinking to be sure you're an "optional" task. As a result, I'd like to not bother this arg list with the shutdown behavior, and have a second method like: PostNonCrucialSequencedWorkerTask(...) WDYT? Do you really want every posted task to think through this and make a choice? I'm also wary that folks will "copy" the calling pattern, and not think through the ramifications. Today, all FileThread tasks are "important," as I *think* we join the thread on completion (yes??). As a result, the default migration should be to those semantics.
Just saw this while skimming the code. http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:537: linked_ptr<Worker> new_thread(new Worker(this, thread_number)); linked_ptr isn't threadsafe...is this code safe? Is it possible another thread the linked_ptr from |threads_| at the same time |new_thread| is getting destroyed in this thread? I think that would lead to concurrent modification of a thread-unsafe data structure.
http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:537: linked_ptr<Worker> new_thread(new Worker(this, thread_number)); I thought the access to this ref-counted data structure was only done under the protection of a lock. You are looking at the creation of the pointer with associated structure, prior to making it available. No one else has access to it until it is pushed into the list of threads (inside the lock). The one place this is later manipulated is when the vector of threads is deleted, by calling threads_.clean(). I believe that is done outside a lock, and might be a problem (as you suspected). The vector is only kept around to be able to do the join... and I've argued that the join is not needed (we should just look for the state to indicate that all critical threads are complete. There will still be a question of how to delete the |inner| instance.... but I think that is solvable. On 2011/12/09 21:41:47, willchan wrote: > linked_ptr isn't threadsafe...is this code safe? Is it possible another thread > the linked_ptr from |threads_| at the same time |new_thread| is getting > destroyed in this thread? I think that would lead to concurrent modification of > a thread-unsafe data structure.
I fixed everything except the one thing noted below. http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:362: cond_var_.Signal(); This is already done the way you are describing. http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:526: return threads_.size() + 1; It's not clear what you want me to move here. http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:537: linked_ptr<Worker> new_thread(new Worker(this, thread_number)); On 2011/12/09 22:13:58, jar wrote: > I thought the access to this ref-counted data structure was only done under the > protection of a lock. You are looking at the creation of the pointer with > associated structure, prior to making it available. No one else has access to > it until it is pushed into the list of threads (inside the lock). Will is correct, this is bad. > The vector is only kept around to be able to do the join... and I've argued that > the join is not needed (we should just look for the state to indicate that all > critical threads are complete. As I explained before, the requirement that the tread be joined is enforced by the tread code in base. I didn't choose to add this. I think the existing semantics of SimpleThread are outside the scope of this review.
http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:362: cond_var_.Signal(); AutoLock() is on line 326, which is the outermost scope of the function. As a result, this Signal() is still done while holding the lock. This is not critical to functional correctness, but at times it can cause what is termed the "box-car performance problem." The problem can involve a thread waking up due to a Signal(), only to find that it can't acquire the lock, and then going back to sleep waiting on the lock. I think Clank manifests this problem, and often does a full-blown context switch (on a single core CPU!) when a signal() is provided. That is guaranteed to land on the other thread with the lock already in use. I *think* that was the most recent horror story I heard about this. On 2011/12/10 05:42:27, brettw wrote: > This is already done the way you are describing. http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:526: return threads_.size() + 1; On 2011/12/10 05:42:27, brettw wrote: > It's not clear what you want me to move here. IMO, the number of running (including to-be-created) threads should be a state variable, and that is what I think should be updated here. I applogize for having that as my mental model, which is not the current code, making my suggestion very confusing. Currently, you rely on threads_.size() to supply this count. That vector size will only be correctly updated after the new thread has started, and has updated the vector with a push(). If instead you used a state variable to keep track of the count of threads, then you can update it here (showing intent to start a new thread) and hence have it correctly set to while you still hold the lock. That change would remove the concern about leaking expressed in line 512. If we don't want to leak, then we wait until the count of running threads goes to zero, and that count is not based on the vector size. http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:537: linked_ptr<Worker> new_thread(new Worker(this, thread_number)); IMO, the would-be-reusable code of SimpleThread is not as reusable as it could be, due to this requirement. The requirement should be removed (if it is really not critical to avoiding abuse), or the class should be modified to make it more reusable. We could, for instance, have a setter (or a constuctor arg) that turns off this CHECK(). On 2011/12/10 05:42:27, brettw wrote: > On 2011/12/09 22:13:58, jar wrote: > > I thought the access to this ref-counted data structure was only done under > the > > protection of a lock. You are looking at the creation of the pointer with > > associated structure, prior to making it available. No one else has access > to > > it until it is pushed into the list of threads (inside the lock). > > Will is correct, this is bad. > > > The vector is only kept around to be able to do the join... and I've argued > that > > the join is not needed (we should just look for the state to indicate that all > > critical threads are complete. > > As I explained before, the requirement that the tread be joined is enforced by > the tread code in base. I didn't choose to add this. I think the existing > semantics of SimpleThread are outside the scope of this review.
http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:362: cond_var_.Signal(); You are incorrect, the AutoLock is not at the outermost scope.
I originally had the created threads flag be an integer count of the number of threads we're currently creating. So the integer you're wanting was threads_.size() + creating_threads_count_. But I think I explicitly don't want to have more than one thread in the "pending" state at once, which is why I check it on line 517. This would be different than just checking the count. Given this, I don't see how your method is any better.
http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/43001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:362: cond_var_.Signal(); You are correct. I was incorrect. Sorry... I kept looking at the curly on the first line of the function <doh>. On 2011/12/10 17:25:28, brettw wrote: > You are incorrect, the AutoLock is not at the outermost scope.
drive by comments, looks pretty nice http://codereview.chromium.org/8416019/diff/49001/content/common/sequenced_wo... File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/49001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:16: #include "base/synchronization/waitable_event.h" is this included needed? http://codereview.chromium.org/8416019/diff/49001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:440: found_task = true; Would it make sense to put the logic under PrepareToStartAdditionalThreadIfHelpful() in this method. As coded, that method will rescan the list from the beginning, whereas this method to scan forward from where the current iter? http://codereview.chromium.org/8416019/diff/49001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:453: int SequencedWorkerPool::Inner::WillRunWorkerTask(Worker* worker, THe 'worker' arg isn't used here or in the DidRun method. http://codereview.chromium.org/8416019/diff/49001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:523: if (terminating_ && i->shutdown_behavior != BLOCK_SHUTDOWN) the outer condition sayz if !terminating_, is this nested if needed?
With nits mentioned in the drive by... LGTM http://codereview.chromium.org/8416019/diff/49001/content/common/sequenced_wo... File content/common/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/49001/content/common/sequenced_wo... content/common/sequenced_worker_pool.cc:523: if (terminating_ && i->shutdown_behavior != BLOCK_SHUTDOWN) On 2011/12/14 20:37:47, michaeln wrote: > the outer condition sayz if !terminating_, is this nested if needed? +1 It seems like these two lines should go.
I addressed Michael's comments except for the PrepareToStartAdditionalThreadIfHelpful one. It wasn't clear to me hwo this would work cleanly and the current scheme isn't optimized for long queues of tasks anyway. If we start having nontrivially long queues, we should refactor in more ways that this.
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/brettw@chromium.org/8416019/96003
Try job failure for 8416019-96003 (retry) (retry) on win_rel for steps "base_unittests, safe_browsing_tests" (clobber build). It's a second try, previously, step "compile" failed. http://build.chromium.org/p/tryserver.chromium/buildstatus?builder=win_rel&nu...
The tests were flaky and I spent a really really long time figuring it out. There were several problems which are fixed in the latest patch (I'm not going to wait for another LGTM, if you have comments I'll be happy to fix in a followup): - (This is the only real bug.) There was a race between creating a thread and clearing the "I'm creating the thread" flag. This prevented us from creating threads when we could have, and sometimes deadlocked the test (which was waiting for more threads to be created). - I also added a long comment explaining why we must check whether we need more threads before we actually run the test. My previous patches did this correctly, but it was accidental. I realized this is critical and somewhat subtle so I documented it better. - There is an inherent race condition in thread creation with shutdown. I had to add an extra phase of the tests to ensure that all the workers are created before we start doing stuff. Otherwise, if shutdown was called, no more threads will be created. I think that this behavior is OK in real life since it's only a problem if you can post lots of tasks and then shutdown faster than you can create threads. I think having this race is better than the alternative of blocking on thread creating in some crazy cases, or having more than one thread creation pending at a time (which causes other challenges). I documented this behavior well. - My unit test was a disaster because I incorrectly assumed if you signal an event X times, we would allow X threads to go through the event. I forgot that those X threads actually need to be waiting on the event! So I added a condition variable that we wait on to ensure that all the workers are actually waiting before they're woken up. This could be done slightly cleaner, I think by replacing the current event with another (or the same) condition variable. But after all the time I've spent getting these tests stable I'm loath to mess with it, and it's pretty short and not bad now anyway.
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/brettw@chromium.org/8416019/102001
I just saw your commnent about having trouble with the unit tests. My apologies. I did a very intense review of the working code, and was not very careful about your test code. Here are comments on the test code, which may explain why you are having trouble (unless I'm confused about how WaitableEvent() is supposed to work). http://codereview.chromium.org/8416019/diff/102001/base/threading/sequenced_w... File base/threading/sequenced_worker_pool_unittest.cc (right): http://codereview.chromium.org/8416019/diff/102001/base/threading/sequenced_w... base/threading/sequenced_worker_pool_unittest.cc:47: // Note that this task has started and signal anwaybody waiting for that nit: anwaybody-->anybody http://codereview.chromium.org/8416019/diff/102001/base/threading/sequenced_w... base/threading/sequenced_worker_pool_unittest.cc:54: With murphy threads, all the tasks *could* pause at line 54, and not reach line 55 to start the Wait(). As a result, a series of signals to the event could all be aggregated, as events don't (I think) sum up). This is the fundamental difference between events, and condidition variables. Cond vars can atomically release the lock, *and* start the wait, so it is impossible to pause between the two points. http://codereview.chromium.org/8416019/diff/102001/base/threading/sequenced_w... base/threading/sequenced_worker_pool_unittest.cc:153: event.Signal(); As noted above (see line 54), this won't consistently work, and will induce flaky bustage. The workers can, after notifying us of their task completion, block before the event wait(). As a result, a multitude of events.Signal() calls can be aggregated, and the desired number of awakenings may not appear. In addition, even if all the threads were actually blocked at event.Wait(), we might fire off all these event.Signal() calls before any awakening had a chance to reset the event... and hence again, signals may be lost (they aggregate into a bool). http://codereview.chromium.org/8416019/diff/102001/base/threading/sequenced_w... base/threading/sequenced_worker_pool_unittest.cc:188: event->Signal(); I don't think event contains a counter, and hence I don't think sending a multitude of Signal()s is guaranteed to awaken the desired number of Wait()ers.
http://codereview.chromium.org/8416019/diff/102001/base/threading/sequenced_w... File base/threading/sequenced_worker_pool_unittest.cc (right): http://codereview.chromium.org/8416019/diff/102001/base/threading/sequenced_w... base/threading/sequenced_worker_pool_unittest.cc:54: Ah, good catch. I'll fix this.
New snap up using cond vars instead of events. I think this is a bit cleaner and should remove the races.
lgtm
I found another real bug that causes test flakyness due to a race condition on thread creation. I kept a count of the number of running threads. This was incremented before I ran a task, and decremented after it was complete. The thread creation code used this to know if there were free threads. The problem is that if a thread was being created, it would not count as "running" until it actually picked up a task. This means that we'd think there was a free thread since the number of threads was greater than the number of running threads, and not create more. I inverted this condition to count "waiting" threads. Both flags are wrong sometimes, but now we err on the side of thinking the thread is busy rather than free, which allows us to create additional threads in this case.
See comments below. If you're not worried, and equally tired, then LGTM. If you encounter flaky tests, due to the race, then you may need to consider changes as mentioned. http://codereview.chromium.org/8416019/diff/62010/base/threading/sequenced_wo... File base/threading/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/62010/base/threading/sequenced_wo... base/threading/sequenced_worker_pool.cc:367: waiting_thread_count_--; This is much nicer, as it is easy to read in one place and see correctness. http://codereview.chromium.org/8416019/diff/62010/base/threading/sequenced_wo... base/threading/sequenced_worker_pool.cc:538: threads_.size() < max_threads_ && Note that you still depend on threads_.size() here... so racy pushing (changing size during the brief holding and release of a lock) can still lurk. You could get rid of the vector if you didn't use join, and instead used the condition variable and the lock. Then you'd have variables that were easier to change (increment and decrement) under the lock. I'm as tired as you are of this CL.... so if the race is not critical, then don't worry abotu this.
You're right! I moved this insertion into the list to the beginning of the actual worker thread. This way, we keep the "inserting into the list" and "clearing the creation flag" at the same time, which I think will solve the problem.
lgtm http://codereview.chromium.org/8416019/diff/117001/base/threading/sequenced_w... File base/threading/sequenced_worker_pool.cc (right): http://codereview.chromium.org/8416019/diff/117001/base/threading/sequenced_w... base/threading/sequenced_worker_pool.cc:336: threads_.push_back(linked_ptr<Worker>(this_worker)); This is nicer and cleaner too. I like that it is right next to the clearing of thread_being_created_. http://codereview.chromium.org/8416019/diff/117001/base/threading/sequenced_w... base/threading/sequenced_worker_pool.cc:540: threads_.size() < max_threads_ && Thinking about it it more, I remembered that you're protected from this race by the boolean thread_being_created_. While that is set, we never test threads_.size(), which probably avoids the race. |