|
|
Created:
10 years, 10 months ago by Scott Hess - ex-Googler Modified:
9 years, 6 months ago CC:
chromium-reviews_googlegroups.com Visibility:
Public. |
DescriptionConvert SafeBrowsingStoreFile to do bulk reads and writes.
Read/write the data in the style of fread/fwrite, rather than doing
I/O element by element. This lays the groundwork for adding
checksumming.
BUG=none
TEST=none
Committed: http://src.chromium.org/viewvc/chrome?view=rev&revision=39619
Committed: http://src.chromium.org/viewvc/chrome?view=rev&revision=40186
Patch Set 1 #
Total comments: 9
Patch Set 2 : Drop functor, dedicated read/write ops for sets. #Patch Set 3 : Typo. #Patch Set 4 : Fix for gcc 4.4 #Patch Set 5 : Additional comment on the tweak #
Total comments: 2
Patch Set 6 : Default-initialize POD contents. #
Messages
Total messages: 18 (0 generated)
OK, now we're getting somewhere. This if the first half of adding checksums to the file format. Broke the CL in half because the evicted and introduced code made it hard to follow that the before and after were equivalent. So this just sets up the chokepoints.
http://codereview.chromium.org/650113/diff/1/3 File chrome/browser/safe_browsing/safe_browsing_store_file.cc (right): http://codereview.chromium.org/650113/diff/1/3#newcode98 chrome/browser/safe_browsing/safe_browsing_store_file.cc:98: class DeletedChunkRemover { name is a bit confusing. Maybe DeletedChunkExists or DeletedChunkTest? http://codereview.chromium.org/650113/diff/1/3#newcode120 chrome/browser/safe_browsing/safe_browsing_store_file.cc:120: std::remove_if(old_end, vec->end(), DeletedChunkRemover<T>(del_set)); I was going to comment that this seemed like it didn't have the nice perf characteristics of your old code, but then I thought better and read the docs for this (poorly named) method. It's pretty cool, but I would have never guessed what it did from its name. http://codereview.chromium.org/650113/diff/1/3#newcode394 chrome/browser/safe_browsing/safe_browsing_store_file.cc:394: std::vector<int32> add_chunks_flat(add_chunks_cache_.begin(), These sets can be pretty large, right? It's a little bit of a bummer that we're doubling their memory use here, but I guess it's not really avoidable. At a minimum, you could create and write these vectors one at a time so we're not paying the extra memory use for both of them at the same time.
LGTM
http://codereview.chromium.org/650113/diff/1/3 File chrome/browser/safe_browsing/safe_browsing_store_file.cc (right): http://codereview.chromium.org/650113/diff/1/3#newcode98 chrome/browser/safe_browsing/safe_browsing_store_file.cc:98: class DeletedChunkRemover { On 2010/02/22 17:50:58, Erik Kay wrote: > name is a bit confusing. Maybe DeletedChunkExists or DeletedChunkTest? Now that you mention it - that _is_ confusing, since it's not the active remover thing. DeletedChunkTester seems more reasonable. http://codereview.chromium.org/650113/diff/1/3#newcode120 chrome/browser/safe_browsing/safe_browsing_store_file.cc:120: std::remove_if(old_end, vec->end(), DeletedChunkRemover<T>(del_set)); On 2010/02/22 17:50:58, Erik Kay wrote: > I was going to comment that this seemed like it didn't have the nice perf > characteristics of your old code, but then I thought better and read the docs > for this (poorly named) method. It's pretty cool, but I would have never > guessed what it did from its name. The remove_if() is good for what it does, writing it longhand would be fine, too. Thinking back, I believe that using the functor is an artifact of how I ended up with this design - now that it's only used in these two places, it's not as compelling. Requiring the reader to look something up is poor, I think. Unfortunately, we can either read in bulk then delete, or read record-by-record and delete before inserting, so at core this probably is a bit less efficient in memory usage. http://codereview.chromium.org/650113/diff/1/3#newcode361 chrome/browser/safe_browsing/safe_browsing_store_file.cc:361: DeletedChunkRemover<SBAddFullHash>(add_del_cache_)); This usage is also kind of degenerate. Once I had a use for DeletedChunkRemover<>(), having multiple styles of testing was poor. http://codereview.chromium.org/650113/diff/1/3#newcode394 chrome/browser/safe_browsing/safe_browsing_store_file.cc:394: std::vector<int32> add_chunks_flat(add_chunks_cache_.begin(), On 2010/02/22 17:50:58, Erik Kay wrote: > These sets can be pretty large, right? It's a little bit of a bummer that we're > doubling their memory use here, but I guess it's not really avoidable. At a > minimum, you could create and write these vectors one at a time so we're not > paying the extra memory use for both of them at the same time. I could recall the distinct writer specific to sets, but I don't think these are relatively all that big. From a 5M database, they're maybe 25k records (100k of data)? Having all callers use the same readers and writers is useful for the checksum change. Hmm. It would be easy to toss this into a wrapper function to get the half-at-a-time without losing the all-through-one-chokepoint. OK.
Sorry to waste a perfectly good LGTM, but ... could you check out these two changes? If you think leave-well-enough-alone, I can draw them back out. Thanks. http://codereview.chromium.org/650113/diff/1/3 File chrome/browser/safe_browsing/safe_browsing_store_file.cc (right): http://codereview.chromium.org/650113/diff/1/3#newcode120 chrome/browser/safe_browsing/safe_browsing_store_file.cc:120: std::remove_if(old_end, vec->end(), DeletedChunkRemover<T>(del_set)); On 2010/02/22 18:25:56, shess wrote: > On 2010/02/22 17:50:58, Erik Kay wrote: > > I was going to comment that this seemed like it didn't have the nice perf > > characteristics of your old code, but then I thought better and read the docs > > for this (poorly named) method. It's pretty cool, but I would have never > > guessed what it did from its name. > > The remove_if() is good for what it does, writing it longhand would be fine, > too. Thinking back, I believe that using the functor is an artifact of how I > ended up with this design - now that it's only used in these two places, it's > not as compelling. Requiring the reader to look something up is poor, I think. > > Unfortunately, we can either read in bulk then delete, or read record-by-record > and delete before inserting, so at core this probably is a bit less efficient in > memory usage. I've expanded away the uses of std::remove_if() and std::remove_copy_if(), once those were the only cases I think they were more confusing than warranted. http://codereview.chromium.org/650113/diff/1/3#newcode394 chrome/browser/safe_browsing/safe_browsing_store_file.cc:394: std::vector<int32> add_chunks_flat(add_chunks_cache_.begin(), On 2010/02/22 18:25:56, shess wrote: > Hmm. It would be easy to toss this into a wrapper function to get the > half-at-a-time without losing the all-through-one-chokepoint. OK. This _was_ easy, and arguably cleaner-looking.
LGTM
+piman because of the ARM change. ARM builder failed with: chrome/browser/safe_browsing/safe_browsing_store_file.cc: In function 'bool<unnamed>::ReadToVectorAndDelete(std::vector<T, std::allocator<_Tp1> >*, size_t, FILE*, const __gnu_cxx::hash_set<int, __gnu_cxx::hash<int>, std::equal_to<int>, std::allocator<int> >&) [with T = SBSubFullHash]': chrome/browser/safe_browsing/safe_browsing_store_file.cc:69: error: '<anonymous>' may be used uninitialized in this function chrome/browser/safe_browsing/safe_browsing_store_file.cc:69: note: '<anonymous>' was declared here chrome/browser/safe_browsing/safe_browsing_store_file.cc:75: error: '<anonymous>' may be used uninitialized in this function chrome/browser/safe_browsing/safe_browsing_store_file.cc:75: note: '<anonymous>' was declared here There were a couple ways to fix it, all involving rearranging templates WRT each other. This one seemed the cleanest. Since the alternate code is itself valid (and is always selected), I was considering just changing to that, but I don't think it's the obvious call to use, here.
On 2010/02/23 21:47:43, shess wrote: > ARM builder failed with: <crap> +1 for websites freely reformatting literal text. Anyhow, Erik LGTM'ed through patch set 3, inclusive, 4 is the operative patch, 5 is a comment tweak.
http://codereview.chromium.org/650113/diff/1011/1012 File chrome/browser/safe_browsing/safe_browsing_store.h (right): http://codereview.chromium.org/650113/diff/1011/1012#newcode46 chrome/browser/safe_browsing/safe_browsing_store.h:46: SBAddPrefix() {} mmh, I wonder if the gcc issue you're seeing is that chunk_id isn't initialized in the default constructor here (and similarly below).
http://codereview.chromium.org/650113/diff/1011/1012 File chrome/browser/safe_browsing/safe_browsing_store.h (right): http://codereview.chromium.org/650113/diff/1011/1012#newcode46 chrome/browser/safe_browsing/safe_browsing_store.h:46: SBAddPrefix() {} On 2010/02/23 22:02:53, piman wrote: > mmh, I wonder if the gcc issue you're seeing is that chunk_id isn't initialized > in the default constructor here (and similarly below). Adding: SBAddPrefix() : chunk_id(0) {} (and likewise to others) _does_ fix the compile error. But the only way I can make sense of that is if it's connecting the default constructor with RemoveDeleted<>(), which then makes me worry that the compiled code will be incorrect. OTOH, since the alternate code uses the same default constructor, it could be unrelated and any perturbation of the template-expansion will "fix" things by changing the inlining decisions. Either approach works for me, I guess.
On Tue, Feb 23, 2010 at 2:22 PM, <shess@chromium.org> wrote: > > http://codereview.chromium.org/650113/diff/1011/1012 > File chrome/browser/safe_browsing/safe_browsing_store.h (right): > > http://codereview.chromium.org/650113/diff/1011/1012#newcode46 > chrome/browser/safe_browsing/safe_browsing_store.h:46: SBAddPrefix() {} > On 2010/02/23 22:02:53, piman wrote: > >> mmh, I wonder if the gcc issue you're seeing is that chunk_id isn't >> > initialized > >> in the default constructor here (and similarly below). >> > > Adding: > SBAddPrefix() : chunk_id(0) {} > (and likewise to others) _does_ fix the compile error. But the only way > I can make sense of that is if it's connecting the default constructor > with RemoveDeleted<>(), which then makes me worry that the compiled code > will be incorrect. OTOH, since the alternate code uses the same default > constructor, it could be unrelated and any perturbation of the > template-expansion will "fix" things by changing the inlining decisions. > > Either approach works for me, I guess. Different inlining may or may not let the compiler detect uninitialized reads - std::vector::resize likely copies a default object into its final position when growing (meaning it will read from the newly constructed object). In general we frown upon uninitialized data (constant source of bugs), so I strongly suggest fixing that. Antoine > > > http://codereview.chromium.org/650113 >
On 2010/02/23 22:31:52, piman wrote: > Different inlining may or may not let the compiler detect uninitialized > reads - std::vector::resize likely copies a default object into its final > position when growing (meaning it will read from the newly constructed > object). > In general we frown upon uninitialized data (constant source of bugs), so I > strongly suggest fixing that. I'm concerned about code-generation because there are no uninitialized reads. The data is uninitialized after the resize(), if ReadArray() succeeds the data has been initialized (from the file), if it fails the second resize() should remove it. So if the code is detecting an uninitialized read, that implies that it can detect an initialized read, and that chunk_id is initialized with 0, thus the count() on the const set in RemoveDeleted() will always be the same and the loop can be reduced appropriately.
On 2010/02/23 22:39:53, shess wrote: > On 2010/02/23 22:31:52, piman wrote: > > Different inlining may or may not let the compiler detect uninitialized > > reads - std::vector::resize likely copies a default object into its final > > position when growing (meaning it will read from the newly constructed > > object). > > In general we frown upon uninitialized data (constant source of bugs), so I > > strongly suggest fixing that. > > I'm concerned about code-generation because there are no uninitialized reads. > The data is uninitialized after the resize(), if ReadArray() succeeds the data > has been initialized (from the file), if it fails the second resize() should > remove it. So if the code is detecting an uninitialized read, that implies that > it can detect an initialized read, and that chunk_id is initialized with 0, thus > the count() on the const set in RemoveDeleted() will always be the same and the > loop can be reduced appropriately. Beyond all that, though, chunk_id is uninitialized in the replacement code, too :-). And ... adding empty destructors to the objects also allows the original code to work.
On 2010/02/23 22:42:27, shess wrote: > On 2010/02/23 22:39:53, shess wrote: > > On 2010/02/23 22:31:52, piman wrote: > > > Different inlining may or may not let the compiler detect uninitialized > > > reads - std::vector::resize likely copies a default object into its final > > > position when growing (meaning it will read from the newly constructed > > > object). > > > In general we frown upon uninitialized data (constant source of bugs), so I > > > strongly suggest fixing that. > > > > I'm concerned about code-generation because there are no uninitialized reads. > > The data is uninitialized after the resize(), if ReadArray() succeeds the data > > has been initialized (from the file), if it fails the second resize() should > > remove it. So if the code is detecting an uninitialized read, that implies > that > > it can detect an initialized read, and that chunk_id is initialized with 0, > thus > > the count() on the const set in RemoveDeleted() will always be the same and > the > > loop can be reduced appropriately. > > Beyond all that, though, chunk_id is uninitialized in the replacement code, too > :-). > > And ... adding empty destructors to the objects also allows the original code to > work. OK, so in the interests of moving this forward... the code in question wants to enlarge the vector, then fill the data directly from a blob. This is a degenerate thing to do, but it is also intentional. Things that seem to fix the error: - Use explicit insert() and erase() as alternatives to resize(). - Default constructors initialize chunk_id. - Provide default destructors. - Provide a copy constructor. - Remove all constructors entirely. Initializing chunk_id seems right, but insofar as the compiler can see that it's used uninitialized and is making a legitimate complaint, I'm not entirely clear that it wouldn't optimize that use WRT the default constructor, which would be incorrect. I kind of like the last option of removing all constructors. I originally added constructors as a convenience for creating new elements.
On Wed, Feb 24, 2010 at 4:53 PM, <shess@chromium.org> wrote: > On 2010/02/23 22:42:27, shess wrote: > >> On 2010/02/23 22:39:53, shess wrote: >> > On 2010/02/23 22:31:52, piman wrote: >> > > Different inlining may or may not let the compiler detect >> uninitialized >> > > reads - std::vector::resize likely copies a default object into its >> final >> > > position when growing (meaning it will read from the newly constructed >> > > object). >> > > In general we frown upon uninitialized data (constant source of bugs), >> so >> > I > >> > > strongly suggest fixing that. >> > >> > I'm concerned about code-generation because there are no uninitialized >> > reads. > >> > The data is uninitialized after the resize(), if ReadArray() succeeds >> the >> > data > >> > has been initialized (from the file), if it fails the second resize() >> should >> > remove it. So if the code is detecting an uninitialized read, that >> implies >> that >> > it can detect an initialized read, and that chunk_id is initialized with >> 0, >> thus >> > the count() on the const set in RemoveDeleted() will always be the same >> and >> the >> > loop can be reduced appropriately. >> > > Beyond all that, though, chunk_id is uninitialized in the replacement >> code, >> > too > >> :-). >> > > And ... adding empty destructors to the objects also allows the original >> code >> > to > >> work. >> > > OK, so in the interests of moving this forward... the code in question > wants to > enlarge the vector, then fill the data directly from a blob. This is a > degenerate thing to do, but it is also intentional. Things that seem to > fix the > error: > > - Use explicit insert() and erase() as alternatives to resize(). > This seems wrong, I don't buy the GCC bug, we have uninitialized stuff that we most likely shouldn't. Switching to insert/erase only mask the problem. > - Default constructors initialize chunk_id. > (and other pod fields) > - Provide default destructors. > - Provide a copy constructor. > Those two probably mask the problem because it makes the struct non-POD and the compiler consider it differently > - Remove all constructors entirely. > I'm not sure why that fixes it... > > Initializing chunk_id seems right, but insofar as the compiler can see that > it's > used uninitialized and is making a legitimate complaint, I'm not entirely > clear > that it wouldn't optimize that use WRT the default constructor, which would > be > incorrect. > What's the concern about initializing the data ? You're doing I/O, that'll be orders of magnitude slower than writing a bunch of 0s in memory. > > I kind of like the last option of removing all constructors. I originally > added > constructors as a convenience for creating new elements. > > http://codereview.chromium.org/650113 >
Sorry for the delay. Sheriffing. On Wed, Feb 24, 2010 at 6:08 PM, Antoine Labour <piman@chromium.org> wrote: > On Wed, Feb 24, 2010 at 4:53 PM, <shess@chromium.org> wrote: >> Initializing chunk_id seems right, but insofar as the compiler can see >> that it's >> used uninitialized and is making a legitimate complaint, I'm not entirely >> clear >> that it wouldn't optimize that use WRT the default constructor, which >> would be >> incorrect. > > What's the concern about initializing the data ? You're doing I/O, that'll > be orders of magnitude slower than writing a bunch of 0s in memory. As I mentioned before, my concern wasn't with initializing the data, it was with the fact that the compiler could connect the dots and didn't see that the data was being overwritten, so I was concerned that inappropriate optimizations would be made. Later I realized that if that were happening, it could happen in any of these cases, so the unit tests will just have to stand guard against this. After reading around about value initializers, I've modified things to do value-initialization on each of the POD's contents. WDYT? > http://codereview.chromium.org/650113
LGTM, thanks. I think it's better that way.
On 2010/02/26 23:46:46, piman wrote: > LGTM, thanks. I think it's better that way. Thank you for the patience. I learned a lot. Probably not stuff I'll ever be able to productively use, but I'll sure wow them at the next cocktail party I attend. |