|
|
Created:
7 years, 1 month ago by rvargas (doing something else) Modified:
6 years, 11 months ago Reviewers:
Randy Smith (Not in Mondays) CC:
chromium-reviews, cbentzel+watch_chromium.org, gavinp+disk_chromium.org Base URL:
svn://svn.chromium.org/chrome/trunk/src/ Visibility:
Public. |
DescriptionDisk cache v3: The main index table.
The IndexTable controls what entries are stored in the cache.
This class provides a memory-only view of the underlying structures
that store "the index"
BUG=241277
TEST=net_unittests
Committed: https://src.chromium.org/viewvc/chrome?view=rev&revision=244027
Patch Set 1 : #
Total comments: 14
Patch Set 2 : #
Total comments: 12
Patch Set 3 : add comments #Patch Set 4 : #
Total comments: 20
Patch Set 5 : Unify UpdateIterator #
Total comments: 9
Patch Set 6 : WalkTables #
Total comments: 25
Patch Set 7 : More comments #Patch Set 8 : remove From*Address use #
Total comments: 24
Patch Set 9 : #
Total comments: 33
Patch Set 10 : for(;;) & Co #
Total comments: 5
Patch Set 11 : #
Total comments: 14
Patch Set 12 : rename address and hash #
Total comments: 10
Patch Set 13 : #Patch Set 14 : Rebase #Patch Set 15 : #
Messages
Total messages: 58 (0 generated)
Ricardo: I'm getting an error accessing this via the "view" links; when I click through I get Error fetching svn://svn.chromium.org/chrome/trunk/src/net/disk_cache/addr.h?rev=231691: InvalidURLError: Invalid request URL: svn://svn.chromium.org/chrome/trunk/src/net/disk_cache/addr.h?rev=231691 Any idea what's going on, or should I escalate via crbug?
On 2013/10/31 17:53:20, rdsmith wrote: > Ricardo: I'm getting an error accessing this via the "view" links; when I click > through I get > > Error fetching > svn://svn.chromium.org/chrome/trunk/src/net/disk_cache/addr.h?rev=231691: > InvalidURLError: Invalid request URL: > svn://svn.chromium.org/chrome/trunk/src/net/disk_cache/addr.h?rev=231691 > > Any idea what's going on, or should I escalate via crbug? No idea, but I just uploaded again and now it shows up via both links. I deleted the broken patch set.
Just a quick heads up that I'm going to be leaving soon, and won't be in tomorrow, so I probably won't get initial comments back to you until Monday. But this is at the top of my queue.
Quick question from doing an initial scan through: It's hard to evaluate an interface without seeing how that interface is going to be used in real code. Is https://codereview.chromium.org/15203004 (diffed against version of trunk before this started landing) or https://codereview.chromium.org/17507006 (diff against current trunk) still authoritative if I want to look at that?
On 2013/10/31 21:24:09, rdsmith wrote: > Quick question from doing an initial scan through: It's hard to evaluate an > interface without seeing how that interface is going to be used in real code. > Is https://codereview.chromium.org/15203004 (diffed against version of trunk > before this started landing) or https://codereview.chromium.org/17507006 (diff > against current trunk) still authoritative if I want to look at that? 15203004 is pretty close... it lacks some cleanup changes I made for this review, but I'll update it to match this code (I'm assuming that you just want to see the code, not to patch it locally). The use is basically: BackendImplV3 owns IndexTable, and initialization is made by: int BackendImplV3::Init(const CompletionCallback& callback) { scoped_refptr<WorkItem> work_item = new WorkItem(WorkItem::WORK_INIT); PostWorkItem(work_item); ---> post a task to the cache thread (BackendWorker) to open and map the files. } void BackendImplV3::OnInitComplete(WorkItem* work_item) { // called when the BackendWorker completes initialization... so that IndexTableInitData is available on the IO thread index_.Init(&result.get()->index_data); } When the index needs to grow: void BackendImplV3::GrowIndex() { scoped_refptr<WorkItem> work_item = new WorkItem(WorkItem::WORK_GROW_INDEX); PostWorkItem(work_item); ----> same thing, ask the worker to extend the files and map them again } void BackendImplV3::OnGrowIndexComplete(WorkItem* work_item) { index_.Init(&result.get()->index_data); ----> replace the old mappings with the new one work_item->set_flags(WorkItem::WORK_COMPLETE); PostWorkItem(work_item); ----> post a task to release the old mappings } To create an entry: int BackendImplV3::CreateEntry(const std::string& key, Entry** entry, const CompletionCallback& callback) { EntrySet entries = index_.LookupEntry(hash); + post a task to open the entry if a match is found } int BackendImplV3::OnCreateEntryComplete(const std::string& key, uint32 hash, ShortEntryRecord* short_record, Entry** entry, const CompletionCallback& callback) { EntryCell cell = index_.CreateEntryCell(hash, entry_address); if (!cell.IsValid()) { } + use the cell. } enumerations: (around BackendImplV3::OpenNext() ) int lower_limit = index_.CalculateTimestamp(work_item->initial_time()); if (iterator->timestamp <= lower_limit || !index_.GetNextCells(iterator)) { return false; } ... while (!cells->empty()) { EntryCell last_cell = index_.FindEntryCell(cells->back().hash, cells->back().address); cells->pop_back(); if (!last_cell.IsValid()) continue; Basically the IndexTable returns a group of entries that match a given timestamp (that has a granularity of 1 minute)... keeps the list around and inspect each element to iterate or do evictions. The process implies sending tasks to the worker thread and when that completes verifying that the cell is ok, by calling FindEntryCell The other user of the IndexTable API is eviction_v3, which grabs the table from the backend and uses it to select entries to evict... but then it just tells the backend to evict a given entry. So basically, all of EvictionV3 + BackendImplV3 + this CL run on the IO thread, and all thread hoping is done by the backend. Let me know if you are interested in the thread logic (BackendWorker and BackendWorkItem)... I assume it's not that relevant at this point. (that is to say, I'm happy to answer questions, the reference CL is not commented so it may be hard to follow).
> Basically the IndexTable returns a group of entries that match a given timestamp > (that has a granularity of 1 minute)... keeps the list around and inspect each That should say "the caller keeps the list" > element to iterate or do evictions. The process implies sending tasks to the > worker thread and when that completes verifying that the cell is ok, by calling > FindEntryCell
On 2013/10/31 22:18:57, rvargas wrote: > On 2013/10/31 21:24:09, rdsmith wrote: > > Quick question from doing an initial scan through: It's hard to evaluate an > > interface without seeing how that interface is going to be used in real code. > > Is https://codereview.chromium.org/15203004 (diffed against version of trunk > > before this started landing) or https://codereview.chromium.org/17507006 (diff > > against current trunk) still authoritative if I want to look at that? > > 15203004 is pretty close... it lacks some cleanup changes I made for this > review, but I'll update it to match this code (I'm assuming that you just want > to see the code, not to patch it locally). It's easiest for me to navigate code in my usual development environment, so if you can make it locally patchable, that'd be useful. If it's a hassle, don't worry about it. (Note that it doesn't need to be at ToT, identical to this CL, or compilable to work well for me--if I can sync to a particular revision, patch, and have the patch succeed, I'm happy). Thanks for the summary below; I'll work on reviewing the CL based on it. > > The use is basically: > BackendImplV3 owns IndexTable, and initialization is made by: > > int BackendImplV3::Init(const CompletionCallback& callback) { > scoped_refptr<WorkItem> work_item = new WorkItem(WorkItem::WORK_INIT); > PostWorkItem(work_item); ---> post a task to the cache thread (BackendWorker) > to open and map the files. > } > > void BackendImplV3::OnInitComplete(WorkItem* work_item) { > // called when the BackendWorker completes initialization... so that > IndexTableInitData is available on the IO thread > index_.Init(&result.get()->index_data); > } > > When the index needs to grow: > void BackendImplV3::GrowIndex() { > scoped_refptr<WorkItem> work_item = new WorkItem(WorkItem::WORK_GROW_INDEX); > PostWorkItem(work_item); ----> same thing, ask the worker to extend the > files and map them again > } > > void BackendImplV3::OnGrowIndexComplete(WorkItem* work_item) { > index_.Init(&result.get()->index_data); ----> replace the old mappings with > the new one > work_item->set_flags(WorkItem::WORK_COMPLETE); > PostWorkItem(work_item); ----> post a task to release the old mappings > } > > To create an entry: > int BackendImplV3::CreateEntry(const std::string& key, Entry** entry, > const CompletionCallback& callback) { > EntrySet entries = index_.LookupEntry(hash); > + post a task to open the entry if a match is found > } > > int BackendImplV3::OnCreateEntryComplete(const std::string& key, uint32 hash, > ShortEntryRecord* short_record, > Entry** entry, > const CompletionCallback& callback) { > EntryCell cell = index_.CreateEntryCell(hash, entry_address); > if (!cell.IsValid()) { > } > > + use the cell. > } > > enumerations: > (around BackendImplV3::OpenNext() ) > > int lower_limit = index_.CalculateTimestamp(work_item->initial_time()); > if (iterator->timestamp <= lower_limit || > !index_.GetNextCells(iterator)) { > return false; > } > > ... > while (!cells->empty()) { > EntryCell last_cell = index_.FindEntryCell(cells->back().hash, > cells->back().address); > cells->pop_back(); > if (!last_cell.IsValid()) > continue; > > Basically the IndexTable returns a group of entries that match a given timestamp > (that has a granularity of 1 minute)... keeps the list around and inspect each > element to iterate or do evictions. The process implies sending tasks to the > worker thread and when that completes verifying that the cell is ok, by calling > FindEntryCell > > The other user of the IndexTable API is eviction_v3, which grabs the table from > the backend and uses it to select entries to evict... but then it just tells the > backend to evict a given entry. So basically, all of EvictionV3 + BackendImplV3 > + this CL run on the IO thread, and all thread hoping is done by the backend. > > Let me know if you are interested in the thread logic (BackendWorker and > BackendWorkItem)... I assume it's not that relevant at this point. (that is to > say, I'm happy to answer questions, the reference CL is not commented so it may > be hard to follow).
> It's easiest for me to navigate code in my usual development environment, so if > you can make it locally patchable, that'd be useful. If it's a hassle, don't > worry about it. (Note that it doesn't need to be at ToT, identical to this CL, > or compilable to work well for me--if I can sync to a particular revision, > patch, and have the patch succeed, I'm happy). 15203004 is synced against rev 199883. I'll work on a tot patch.
On 2013/11/01 22:26:06, rvargas wrote: > > It's easiest for me to navigate code in my usual development environment, so > if > > you can make it locally patchable, that'd be useful. If it's a hassle, don't > > worry about it. (Note that it doesn't need to be at ToT, identical to this > CL, > > or compilable to work well for me--if I can sync to a particular revision, > > patch, and have the patch succeed, I'm happy). > > 15203004 is synced against rev 199883. I'll work on a tot patch. And now 17507006 is also up to day, to ToT.
Thanks! I'll try to get a first round back today.
On 2013/11/04 16:04:13, rdsmith wrote: > Thanks! I'll try to get a first round back today. Apologies; high priority review request for M32 + interview today -> I'll be getting back to this review tomorrow.
Quick question below (not blocking review, but I suspect you'll give me the answer faster than I'll figure it out from the code :-}). https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_... File net/disk_cache/v3/index_table.h (right): https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_... net/disk_cache/v3/index_table.h:117: // combination of hash and address. Sorry, another question. I'm not seeing how the hash is necessary to uniquely identify an index table entity. Under what circumstances will an entity share the same address as another entity?
https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_... File net/disk_cache/v3/index_table.h (right): https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_... net/disk_cache/v3/index_table.h:117: // combination of hash and address. On 2013/11/05 19:52:26, rdsmith wrote: > Sorry, another question. I'm not seeing how the hash is necessary to uniquely > identify an index table entity. Under what circumstances will an entity share > the same address as another entity? Never :). The thing is that locating an address without the hash means going over every entry looking for a match. The hash allows fast lookup. I'll clarify the comment.
Still working on wrapping my brain around the CL. One questions; What is the intention for the difference in usage between EntryCell and IndexCell? I understand that IndexCell is a struct (so, among other things, it can't really have methods without a lot of contorting) but I'm wondering what gets in the way of making EntryCell the way that the IndexCell information is used anywhere where we're not actually trying to write an IndexCell in a bucket? My goal here is bundling all of the functions at the beginning of index_table.cc that manipulate IndexCells into member functions on EntryCell and more rarely using IndexCells. (This question is being asked more to clarify my thinking around IndexCell/EntryCell than it is a serious suggestion; I figure I'll learn something important from the answer.) https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_... net/disk_cache/v3/index_table.cc:362: memcpy(destination, &cell_, sizeof(cell_)); nit: Why not assignment? It would seem simpler. https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_... net/disk_cache/v3/index_table.cc:385: if (main_table_) { What's the difference between this case and |growing == true|? The bits comparison makes it look like we're confirming that we've just bumped up the number of bits in the table by one, i.e. that we're growing. And this function is the only place we set either main_table_ or header_. So it would seem that this is only different from |growing == true| if params was first passed with a null main_table_. When can that happen/what does it mean? https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_... net/disk_cache/v3/index_table.cc:390: old_extra_table.reset(new IndexBucket[extra_size]); nit: double whitespace. https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_... net/disk_cache/v3/index_table.cc:800: int time = GetCellTimestamp(current_cell); nit: This value doesn't seem like it's used?
On 2013/11/05 21:58:21, rdsmith wrote: > Still working on wrapping my brain around the CL. > > One questions; What is the intention for the difference in usage between > EntryCell and IndexCell? I understand that IndexCell is a struct (so, among > other things, it can't really have methods without a lot of contorting) but I'm > wondering what gets in the way of making EntryCell the way that the IndexCell > information is used anywhere where we're not actually trying to write an > IndexCell in a bucket? My goal here is bundling all of the functions at the > beginning of index_table.cc that manipulate IndexCells into member functions on > EntryCell and more rarely using IndexCells. (This question is being asked more > to clarify my thinking around IndexCell/EntryCell than it is a serious > suggestion; I figure I'll learn something important from the answer.) There are two basic reasons for that: - I want to keep the EntryCell interface as clean as possible, so that it is easy to see exactly what public users are supposed to do, and a short list of things that are visible only for privileged users (the table itself). On the same lines, keeping a group of functions that directly manipulate the bitmaps to set and extract the stored values, and nothing else, seems like a good thing to me. - The index table itself clearly has to be aware of how things are stored on the table, so it should deal with IndexCells for some stuff. At some point it can make the decision of instantiating an EntryCell and using it to reference an underlying cell, but I tend to do that only when the work to do with a cell is "significant", not just inspecting a cell to see if we care about it or not. Instantiating an EntryCell implies copying of data and breaking the link between something on the table and the EntryCell (EntryCells are not really part of the table, they always live on the stack or heap).
and the comments https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_... net/disk_cache/v3/index_table.cc:362: memcpy(destination, &cell_, sizeof(cell_)); On 2013/11/05 21:58:21, rdsmith wrote: > nit: Why not assignment? It would seem simpler. no reason https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_... net/disk_cache/v3/index_table.cc:385: if (main_table_) { On 2013/11/05 21:58:21, rdsmith wrote: > What's the difference between this case and |growing == true|? The bits > comparison makes it look like we're confirming that we've just bumped up the > number of bits in the table by one, i.e. that we're growing. And this function > is the only place we set either main_table_ or header_. So it would seem that > this is only different from |growing == true| if params was first passed with a > null main_table_. When can that happen/what does it mean? There are two cases for increasing the size: - We could be doubling the size of the table (which means changing the mask_ etc) - We are not extending the main table, just adding more entries to the extra table. For example, we can have a 64k main table with maybe 8k entries on the extra table (max 72k), and we just decided to add another 8k at the end (grow to 80k). At some point, when we get close to 128k, we double the main table and reset to a small extra table. So, |growing| is true every time after the original initialization... but a main_table will only be passed in when doubling it. (I'll add a comment with this) https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_... net/disk_cache/v3/index_table.cc:390: old_extra_table.reset(new IndexBucket[extra_size]); On 2013/11/05 21:58:21, rdsmith wrote: > nit: double whitespace. Done. https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_... net/disk_cache/v3/index_table.cc:800: int time = GetCellTimestamp(current_cell); On 2013/11/05 21:58:21, rdsmith wrote: > nit: This value doesn't seem like it's used? Done.
A couple more interface level questions. I'll start working on reviewing the implementation now. https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_... File net/disk_cache/v3/index_table.h (right): https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_... net/disk_cache/v3/index_table.h:120: struct IndexIterator { Index iterators look to me to be used in a very specific way, which is to get a group of cells all of which have the same timestamp, and when passed back in again to the appropriate function, to get the oldest/youngest group of cells from the current timestamp. This seems a general enough concept to be worth documenting; willing to throw in a comment discussing that here? https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_... net/disk_cache/v3/index_table.h:176: int CalculateTimestamp(base::Time time); The existence of this function (and the fact that it's breaching the abstraction boundary around index table timestamps) bothers me a little bit. In an ideal world, I'd think that either this class interface should only take base::Time values, or we should define somewhere (maybe just in a comment) the minutes since base value style of timestamps, and this class interface should only take those values. It looks to me from the target CL that the only place this function is (will eventually be) used for iterator timestamp values, which suggests we could do the abstraction cleanup by making the interface for IndexIterators be base::Time (probably plus an IndexTable, possibly in the constructor). I'm not strongly in favor of that (there could very easily be complexities having to do with threading and access to the base value that I'm not seeing), but I wanted to raise the possibility in case it appealed to you. https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_... net/disk_cache/v3/index_table.h:191: // Called each time the index should save the backup information. I presume this is part of the class interface because the backup of the index needs to be synchronized with other backend backup activities. If that's accurate, willing to put in a comment to that effect? (Something like "Exposed so that index backup can be synchronized with the backup of the IndexBackend that owns the index."?)
https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_... File net/disk_cache/v3/index_table.h (right): https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_... net/disk_cache/v3/index_table.h:120: struct IndexIterator { On 2013/11/05 22:31:13, rdsmith wrote: > Index iterators look to me to be used in a very specific way, which is to get a > group of cells all of which have the same timestamp, and when passed back in > again to the appropriate function, to get the oldest/youngest group of cells > from the current timestamp. This seems a general enough concept to be worth > documenting; willing to throw in a comment discussing that here? Done. https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_... net/disk_cache/v3/index_table.h:176: int CalculateTimestamp(base::Time time); On 2013/11/05 22:31:13, rdsmith wrote: > The existence of this function (and the fact that it's breaching the abstraction > boundary around index table timestamps) bothers me a little bit. In an ideal > world, I'd think that either this class interface should only take base::Time > values, or we should define somewhere (maybe just in a comment) the minutes > since base value style of timestamps, and this class interface should only take > those values. > > It looks to me from the target CL that the only place this function is (will > eventually be) used for iterator timestamp values, which suggests we could do > the abstraction cleanup by making the interface for IndexIterators be base::Time > (probably plus an IndexTable, possibly in the constructor). > > I'm not strongly in favor of that (there could very easily be complexities > having to do with threading and access to the base value that I'm not seeing), > but I wanted to raise the possibility in case it appealed to you. The issue is that I intentionally wanted to keep the iterators as dumb structs (even though now they have hidden ctor/dtors for reasons beyond my comprehension at this time). And the time on the iterator cannot be a high resolution variable, it has to be a low resolution timestamp, otherwise time comparisons will fail. Modifying the iterators to provide comparison functions and keep a pointer back to the table to do translations seem like a high burden to hide the fact that the resolution is lower on the table... and having a Time interface may make the caller assume that things have a higher resolution. I don't completely follow the first suggestion. Do you mean a typedef of int to minutes or to timestamp? https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_... net/disk_cache/v3/index_table.h:191: // Called each time the index should save the backup information. On 2013/11/05 22:31:13, rdsmith wrote: > I presume this is part of the class interface because the backup of the index > needs to be synchronized with other backend backup activities. If that's > accurate, willing to put in a comment to that effect? (Something like "Exposed > so that index backup can be synchronized with the backup of the IndexBackend > that owns the index."?) Kind of. I can add more comments but I want to understand what you mean first. I considered having a more direct command from the backend instead (DoFoo), but I wanted to keep the essence of being a general purpose timer tick that can be used by this class if it wants to (and for whatever it wants to)... and that happens to be the same timer that other parts of the code use for other things (which is _sort of_ required for things to work nicely). ... and the timer is provided from outside because it makes testing much easier than just having an independent timer private to this class (plus seems more wasteful). So in terms of the need for synchronization that is just tangential... as in the backend will delay operations for a while that should be enough for the index to update the backup... so a number of choices would work for that. So I don't know exactly how to frame the comment.
(Just responding to questions/comments; will continue with review.) https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_... File net/disk_cache/v3/index_table.h (right): https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_... net/disk_cache/v3/index_table.h:176: int CalculateTimestamp(base::Time time); On 2013/11/06 02:23:03, rvargas wrote: > The issue is that I intentionally wanted to keep the iterators as dumb structs > (even though now they have hidden ctor/dtors for reasons beyond my comprehension > at this time). Why keep them as dumb structs? I understand the motivation for the disk format structs, but why the iterator? > And the time on the iterator cannot be a high resolution > variable, it has to be a low resolution timestamp, otherwise time comparisons > will fail. Yep, I get that. That's not the same as converting from a high resolution value on construction or initialization, though (though I think we're clear on this from your comments below). > Modifying the iterators to provide comparison functions and keep a pointer back > to the table to do translations seem like a high burden to hide the fact that > the resolution is lower on the table... and having a Time interface may make the > caller assume that things have a higher resolution. Yeah, that makes sense. I will say that since an IndexIterator only makes sense in the context of a specific IndexTable (because the minutes it stores are relevant to that IndexTable), I think a pointer from iterator to table actually wouldn't be an unreasonable thing. To be clear, my goal was not to hide the fact that the resolution is lower on the table--it's to make the interface consistent. If everything on the index table took ints that were minutes since some fixed time, I'd be happy. It's the fact that consumes of the IndexTable interface both use base::Time and have to understand and use the minutes-since-header_->base_time concept (in manipulating the iterators). > I don't completely follow the first suggestion. Do you mean a typedef of int to > minutes or to timestamp? typedef of int to minutes would do it, or just a comment explaining that certain arguments (maybe named index_minutes? :-}) were minutes since header_->base_time. But that's not really ideal either, because header_->base_time is a concept that's pretty far internal to the IndexTable. So this may not be worth it. I just wanted to call out what I saw as an abstraction breaking (that the minutes since base time concept escapes outside the boundary of the IndexTable/IndexIterator interface) and explore whether there was some way of fixing that. https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_... net/disk_cache/v3/index_table.h:191: // Called each time the index should save the backup information. On 2013/11/06 02:23:03, rvargas wrote: > On 2013/11/05 22:31:13, rdsmith wrote: > > I presume this is part of the class interface because the backup of the index > > needs to be synchronized with other backend backup activities. If that's > > accurate, willing to put in a comment to that effect? (Something like > "Exposed > > so that index backup can be synchronized with the backup of the IndexBackend > > that owns the index."?) > > Kind of. I can add more comments but I want to understand what you mean first. I > considered having a more direct command from the backend instead (DoFoo), but I > wanted to keep the essence of being a general purpose timer tick that can be > used by this class if it wants to (and for whatever it wants to)... and that > happens to be the same timer that other parts of the code use for other things > (which is _sort of_ required for things to work nicely). > > ... and the timer is provided from outside because it makes testing much easier > than just having an independent timer private to this class (plus seems more > wasteful). > > So in terms of the need for synchronization that is just tangential... as in the > backend will delay operations for a while that should be enough for the index to > update the backup... so a number of choices would work for that. > > So I don't know exactly how to frame the comment. "Interface exposed for testing, and in case backup of these data structures should be synchronized with other backup operations."? I'm mostly reacting to the fact that when I read through the code I was initially confused as to why this wasn't internal with its own timer--your reasons make sense to me, but I think there's some use in making them explicit. (I'll admit that hearing that the synchronization is "sort of" required for things to work nicely makes me nervous; that's something I'd very much like to have explained in its own comment in the code. But I suspect that comment doesn't belong here or in this CL.)
https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_... File net/disk_cache/v3/index_table.h (right): https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_... net/disk_cache/v3/index_table.h:176: int CalculateTimestamp(base::Time time); On 2013/11/06 18:16:15, rdsmith wrote: > On 2013/11/06 02:23:03, rvargas wrote: > > The issue is that I intentionally wanted to keep the iterators as dumb structs > > (even though now they have hidden ctor/dtors for reasons beyond my > comprehension > > at this time). > > Why keep them as dumb structs? I understand the motivation for the disk format > structs, but why the iterator? It's just that I see this as a storage for iteration-related data, not really as an object that is trying to encapsulate something. It's main purpose is to avoid passing a bunch of separate arguments to all the functions that are involved with enumerations. > > > And the time on the iterator cannot be a high resolution > > variable, it has to be a low resolution timestamp, otherwise time comparisons > > will fail. > > Yep, I get that. That's not the same as converting from a high resolution value > on construction or initialization, though (though I think we're clear on this > from your comments below). > > > Modifying the iterators to provide comparison functions and keep a pointer > back > > to the table to do translations seem like a high burden to hide the fact that > > the resolution is lower on the table... and having a Time interface may make > the > > caller assume that things have a higher resolution. > > Yeah, that makes sense. I will say that since an IndexIterator only makes sense > in the context of a specific IndexTable (because the minutes it stores are > relevant to that IndexTable), I think a pointer from iterator to table actually > wouldn't be an unreasonable thing. it's certainly not unreasonable :) > > To be clear, my goal was not to hide the fact that the resolution is lower on > the table--it's to make the interface consistent. If everything on the index > table took ints that were minutes since some fixed time, I'd be happy. It's the > fact that consumes of the IndexTable interface both use base::Time and have to > understand and use the minutes-since-header_->base_time concept (in manipulating > the iterators). > > > I don't completely follow the first suggestion. Do you mean a typedef of int > to > > minutes or to timestamp? > > typedef of int to minutes would do it, or just a comment explaining that certain > arguments (maybe named index_minutes? :-}) were minutes since > header_->base_time. But that's not really ideal either, because > header_->base_time is a concept that's pretty far internal to the IndexTable. > So this may not be worth it. I just wanted to call out what I saw as an > abstraction breaking (that the minutes since base time concept escapes outside > the boundary of the IndexTable/IndexIterator interface) and explore whether > there was some way of fixing that. I tried the typedef (Timestamp) but I didn't liked it!. I agree that I don't want code outside of this class worrying about the actual resolution of the timestamp... if it is 30 seconds, or 1 minute of whatever... and that's why I don't want to make it part of the contract that it is actually 1 minute (although there's a unit test for that)... but I still like the simple idea of having the caller be aware of the existence of a low resolution time without trying to hide that detail behind a full iterator interface. So I added a longer comment. https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_... net/disk_cache/v3/index_table.h:191: // Called each time the index should save the backup information. On 2013/11/06 18:16:15, rdsmith wrote: > On 2013/11/06 02:23:03, rvargas wrote: > > On 2013/11/05 22:31:13, rdsmith wrote: > > > I presume this is part of the class interface because the backup of the > index > > > needs to be synchronized with other backend backup activities. If that's > > > accurate, willing to put in a comment to that effect? (Something like > > "Exposed > > > so that index backup can be synchronized with the backup of the IndexBackend > > > that owns the index."?) > > > > Kind of. I can add more comments but I want to understand what you mean first. > I > > considered having a more direct command from the backend instead (DoFoo), but > I > > wanted to keep the essence of being a general purpose timer tick that can be > > used by this class if it wants to (and for whatever it wants to)... and that > > happens to be the same timer that other parts of the code use for other things > > (which is _sort of_ required for things to work nicely). > > > > ... and the timer is provided from outside because it makes testing much > easier > > than just having an independent timer private to this class (plus seems more > > wasteful). > > > > So in terms of the need for synchronization that is just tangential... as in > the > > backend will delay operations for a while that should be enough for the index > to > > update the backup... so a number of choices would work for that. > > > > So I don't know exactly how to frame the comment. > > "Interface exposed for testing, and in case backup of these data structures > should be synchronized with other backup operations."? I'm mostly reacting to > the fact that when I read through the code I was initially confused as to why > this wasn't internal with its own timer--your reasons make sense to me, but I > think there's some use in making them explicit. > > (I'll admit that hearing that the synchronization is "sort of" required for > things to work nicely makes me nervous; that's something I'd very much like to > have explained in its own comment in the code. But I suspect that comment > doesn't belong here or in this CL.) How about now?
Sorry; a lot of this review is still my climbing the cache learning curve, but I *am* making progress :-}. There are a couple of abstractions I want to dig into further (I mention the overlap in code between GetOldest() and GetNextCells() below, and I'd like to think about whether there's a way to better encapsulate the small vs. large table distinction and deal with the abstraction breakage mentioned in the initial comment, but I haven't quite understood the breakage yet). But here's another round of comments, hopefully moving the process forward. https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_... net/disk_cache/v3/index_table.cc:674: bucket_id = GetNextBucket(mask_ + 1, header()->max_bucket, extra_table_, It took me a bit of time to figure out why the iteration was being done this way, rather than just hammering through all the buckets in the index. So I think there could usefully be a comment somewhere in this file outlining the algorithm for iterating through the index, given that that algorithm is used in several places in the file. (I thought about suggesting we try to abstract that algorithm in some fashion, but I couldn't come up with anything that didn't involve function pointers/callbacks, which didn't strike me as worth it). https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:208: void UpdateListWithCell(int bucket_hash, A quick comment describing what this does? It's non-trivial (i.e. you, or at least I, needed to read the code closely to understand it), it's not really self-describing, and it's used as a primitive in another function. Maybe "If the passed cell is the same timestamp as |*list_time|, add it to the list. If it's an earlier timestamp, replace the list with it and update |*list_time|. https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:208: void UpdateListWithCell(int bucket_hash, Why is |bucket_hash| an argument to this function? It isn't used. https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:423: extra_table_ = params->extra_table; Could you put a comment in as to why this is safe to do in the non-doubling case, but we need to save and copy the cells in the non-doubling case? (I presume it's because in the non-doubling case we're memory mapping the same section of file + some extra, so we don't lose anything, but if we're doubling we want to copy the stuff that used to be extra into the main section, but I think that's worth calling out.) https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:434: int num_words = (header_->table_len + 31) / 32; table_len is in bits? I wouldn't have naturally assumed that; might be worth a comment somewhere. https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:874: if ((iterator->forward && time < iterator->timestamp) || This next section of code looks to have a lot of overlap with UpdateListFromCells; might it be worthwhile to add a directional argument to that function and re-use it here? Alternatively, would it make sense to change UpdateListWithCell operate on iterators, and use iterators in the GetOldest/GetOldestFromBucket logic? That might simplify the calling signature for GetOldestFromBucket because the iterators would combine the *_use and *_use_time arguments. (I'm feeling like there's a fair amount of overlap between the functionality of GetOldest/GetNextCells and looking for ways to combine code--I'm also wondering if there might be some way to combine GetNewestFromBucket() and GetOldestFromBucket(). If you like one of the above suggestions I may have other ideas after I see that code.) https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.h (right): https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:124: // the oldest / youngest group of cells from the current timestamp. This doesn't seem quite right to me / doesn't match the code as I read it. As I read the code, what's happening is that when you call GetNextCells, you note the current timestamp, trash the current set of cells, and get all cells that are in the next timestamp that has any cells in it in the direction you're going. That right? If so, I find "from the current timestamp" confusing; it's really from the next oldest/youngest timestamp. It also looks to me as if currently, GetNextCells() is the only function that takes/uses iterators. If that's true and is likely to stay true for a while, referring directly to that function would make sense to me. I also have a slight tug towards making this a nested class for IndexTable, since it's only ever used with IndexTable. I remember that nested classes are discouraged, but my memory as to why is that often their definitions aren't needed in the header files, and this definition is in the header file anyway. But it might be worth avoiding because it's not a common pattern, so this is just a suggestion. https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:130: int timestamp; // The current low resolution timestamp for |cells|. Worth having a reference to IndexTable::CalculateTimestamp? https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:178: EntrySet LookupEntry(uint32 hash); nit, suggestion: This is a bit dependent on the definition of "Entry", but to me, it's looking up multiple entries (all entries that share the same hash). So based on my current understanding of the abstractions, I think it would be better as LookupEntries(). WDYT? https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:189: // of cells will share the same timestamp. The comment looks good to me. Is it worthwhile having a reference to IndexIterator here, since that's the only place that the consumer would use the low resolution timestamp? https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:224: int limit_time, IndexIterator* iterator); I'm bothered by the parallel names here for two functions that do fairly different things. Maybe rename GetNewestFromBucket to UpdateIteratorFromBucket?
Thanks. https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_... net/disk_cache/v3/index_table.cc:674: bucket_id = GetNextBucket(mask_ + 1, header()->max_bucket, extra_table_, On 2013/11/07 20:25:15, rdsmith wrote: > It took me a bit of time to figure out why the iteration was being done this > way, rather than just hammering through all the buckets in the index. So I What do you mean by hammering through all the buckets? I think we are hammering through all the buckets :). It's just that the structure of the table implies that to locate something on the extra table one has to start on a given bucket on the main table. So we could ignore the extra buckets when looking at the main table, but we really cannot ignore the main table when looking at the extra table. Is that what you mean? > think there could usefully be a comment somewhere in this file outlining the > algorithm for iterating through the index, given that that algorithm is used in > several places in the file. (I thought about suggesting we try to abstract that > algorithm in some fashion, but I couldn't come up with anything that didn't > involve function pointers/callbacks, which didn't strike me as worth it). I think I went through that same phase at some point, but the conclusion was the same: attempting to generalize the code was not worth it / would just complicate things. https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:208: void UpdateListWithCell(int bucket_hash, On 2013/11/07 20:25:15, rdsmith wrote: > A quick comment describing what this does? It's non-trivial (i.e. you, or at > least I, needed to read the code closely to understand it), it's not really > self-describing, and it's used as a primitive in another function. Maybe "If > the passed cell is the same timestamp as |*list_time|, add it to the list. If > it's an earlier timestamp, replace the list with it and update |*list_time|. will do https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:208: void UpdateListWithCell(int bucket_hash, On 2013/11/07 20:25:15, rdsmith wrote: > Why is |bucket_hash| an argument to this function? It isn't used. Sorry about that... it's just the result of refactoring the code. https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:423: extra_table_ = params->extra_table; On 2013/11/07 20:25:15, rdsmith wrote: > Could you put a comment in as to why this is safe to do in the non-doubling > case, but we need to save and copy the cells in the non-doubling case? (I > presume it's because in the non-doubling case we're memory mapping the same > section of file + some extra, so we don't lose anything, but if we're doubling > we want to copy the stuff that used to be extra into the main section, but I > think that's worth calling out.) Will do, and you're correct. https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:434: int num_words = (header_->table_len + 31) / 32; On 2013/11/07 20:25:15, rdsmith wrote: > table_len is in bits? I wouldn't have naturally assumed that; might be worth a > comment somewhere. table_len is in cells (aka, buckets * 4). But we need to store a bitmap, which has one bit per cell, so this is how many 32-bit words that bitmap needs. I'll add a comment. https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:874: if ((iterator->forward && time < iterator->timestamp) || On 2013/11/07 20:25:15, rdsmith wrote: > This next section of code looks to have a lot of overlap with > UpdateListFromCells; might it be worthwhile to add a directional argument to > that function and re-use it here? Alternatively, would it make sense to change > UpdateListWithCell operate on iterators, and use iterators in the > GetOldest/GetOldestFromBucket logic? That might simplify the calling signature > for GetOldestFromBucket because the iterators would combine the *_use and > *_use_time arguments. > > (I'm feeling like there's a fair amount of overlap between the functionality of > GetOldest/GetNextCells and looking for ways to combine code--I'm also wondering > if there might be some way to combine GetNewestFromBucket() and > GetOldestFromBucket(). If you like one of the above suggestions I may have > other ideas after I see that code.) I'll look at it again. This was the main part where I mentioned that I considered ways to extract code into a single implementation method but I couldn't come up with anything that was simpler (worth it)... as in at least this way is easier to understand. https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.h (right): https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:124: // the oldest / youngest group of cells from the current timestamp. On 2013/11/07 20:25:15, rdsmith wrote: > This doesn't seem quite right to me / doesn't match the code as I read it. As I > read the code, what's happening is that when you call GetNextCells, you note the > current timestamp, trash the current set of cells, and get all cells that are in > the next timestamp that has any cells in it in the direction you're going. That > right? If so, I find "from the current timestamp" confusing; it's really from > the next oldest/youngest timestamp. I'll clarify the comment. It doesn't mean that the caller sets the timestamp and the callee returns cells that match it... it means that when the iterator is returned, the cells in the iterator share the timestamp (as it is set on return). So GetNext(from 0) will returns cells with 1, or 2 or n as timestamp, not 0. I'll also link the comment with the API. As for a nested class, I don't mind it either way, but I know the guide applies even to Delegates, which are tightly tied to the class that declares it and are always going to be part of the same header... and we have lots of nested delegates, but AFAIK we want to go in the other direction. But given that I don't mind either way I'll leave it up to you. > > It also looks to me as if currently, GetNextCells() is the only function that > takes/uses iterators. If that's true and is likely to stay true for a while, > referring directly to that function would make sense to me. > > I also have a slight tug towards making this a nested class for IndexTable, > since it's only ever used with IndexTable. I remember that nested classes are > discouraged, but my memory as to why is that often their definitions aren't > needed in the header files, and this definition is in the header file anyway. > But it might be worth avoiding because it's not a common pattern, so this is > just a suggestion. https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:178: EntrySet LookupEntry(uint32 hash); On 2013/11/07 20:25:15, rdsmith wrote: > nit, suggestion: This is a bit dependent on the definition of "Entry", but to > me, it's looking up multiple entries (all entries that share the same hash). So > based on my current understanding of the abstractions, I think it would be > better as LookupEntries(). WDYT? sure
https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:874: if ((iterator->forward && time < iterator->timestamp) || On 2013/11/08 04:22:30, rvargas wrote: > On 2013/11/07 20:25:15, rdsmith wrote: > > This next section of code looks to have a lot of overlap with > > UpdateListFromCells; might it be worthwhile to add a directional argument to > > that function and re-use it here? Alternatively, would it make sense to > change > > UpdateListWithCell operate on iterators, and use iterators in the > > GetOldest/GetOldestFromBucket logic? That might simplify the calling > signature > > for GetOldestFromBucket because the iterators would combine the *_use and > > *_use_time arguments. > > > > (I'm feeling like there's a fair amount of overlap between the functionality > of > > GetOldest/GetNextCells and looking for ways to combine code--I'm also > wondering > > if there might be some way to combine GetNewestFromBucket() and > > GetOldestFromBucket(). If you like one of the above suggestions I may have > > other ideas after I see that code.) > > I'll look at it again. This was the main part where I mentioned that I > considered ways to extract code into a single implementation method but I > couldn't come up with anything that was simpler (worth it)... as in at least > this way is easier to understand. Done.
Still reviewing, but this is a bundle of comments all about iteration through the table, and so I figure I can share these and we can work in parallel. https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_... net/disk_cache/v3/index_table.cc:674: bucket_id = GetNextBucket(mask_ + 1, header()->max_bucket, extra_table_, On 2013/11/08 04:22:30, rvargas wrote: > On 2013/11/07 20:25:15, rdsmith wrote: > > It took me a bit of time to figure out why the iteration was being done this > > way, rather than just hammering through all the buckets in the index. So I > > What do you mean by hammering through all the buckets? Well, what I meant was |for (int i = 0; i < header()->max_bucket(); i++))| and removing the call to GetNextBucket(). But I think that was based on an incorrect understanding of the layout (i.e. that everything was contiguous in main_table_; I hadn't put together the passing of extra_table_ with how GetNextBucket() works). Ooops. Ok, let me reflect back what I now think is going on: * main_table_ points to a table of size mask_ + 1. That grows by doubling only. * extra_table_ points to growth in the table when we don't need enough extra entries to double the main table. This is "easier" growth, since it just means extending the extra_table file, and everything else (other than the size of that file/table) stays unaltered. There's a bit of a change when we switch pointers from the old mapping of the extra table file to the new mapping, but because they map the same file, it's fine for them to both be active for a bit. * All bucket chains start in the main table, but all buckets except the first in any chain will be in the extra table. > I think we are hammering > through all the buckets :). It's just that the structure of the table implies > that to locate something on the extra table one has to start on a given bucket > on the main table. So we could ignore the extra buckets when looking at the main > table, but we really cannot ignore the main table when looking at the extra > table. Is that what you mean? Based on my new understanding I'll modify my thought (not even a suggestion; just the alternative algorithm I went through in getting to my current understanding). Could we write the above loop as?: for (int i = 0; i < static_cast<int32>(mask_ + 1); i++) GetNewestFromBucket(&main_table_[i], i, current_time, iterator); for (int i = static_cast<int32>(mask_ + 1); i < header()->max_bucket; i++) { IndexBucket* bucket = &extra_table_[i - static_cast<int32>(mask_ + 1)]; GetNewestFromBucket(bucket, bucket->hash, current_time, iterator); } ?? Not saying we should, just trying to make sure I understand the data structure format. > > > think there could usefully be a comment somewhere in this file outlining the > > algorithm for iterating through the index, given that that algorithm is used > in > > several places in the file. (I thought about suggesting we try to abstract > that > > algorithm in some fashion, but I couldn't come up with anything that didn't > > involve function pointers/callbacks, which didn't strike me as worth it). > > I think I went through that same phase at some point, but the conclusion was the > same: attempting to generalize the code was not worth it / would just complicate > things. Makes sense. However, if I'm struggling as much as I am with getting solidly on top of the iteration method and it's rationale, I'd like to have a comment somewhere that sketches those two things out. Is there such a comment? If not, would you be willing to put one in? https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_... File net/disk_cache/v3/index_table.h (right): https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_... net/disk_cache/v3/index_table.h:176: int CalculateTimestamp(base::Time time); On 2013/11/07 02:02:07, rvargas wrote: > On 2013/11/06 18:16:15, rdsmith wrote: > > On 2013/11/06 02:23:03, rvargas wrote: > > > The issue is that I intentionally wanted to keep the iterators as dumb > structs > > > (even though now they have hidden ctor/dtors for reasons beyond my > > comprehension > > > at this time). > > > > Why keep them as dumb structs? I understand the motivation for the disk > format > > structs, but why the iterator? > > It's just that I see this as a storage for iteration-related data, not really as > an object that is trying to encapsulate something. It's main purpose is to avoid > passing a bunch of separate arguments to all the functions that are involved > with enumerations. Hmmm. Makes sense. I think part of why I pull in the other direction is just because I find it useful to conceptually bundle groups of related functions together. Specifically InitIterator and UpdateIterator are very much "work with the iterator data structure", so I feel a pull to declare them as methods of iterator. (InitIterator() could be the Iterator constructor with just a bit of tweaking). I could imagine those being fine as methods on a struct, but that's a slippery slope to full class status. So it's your call. https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.h (right): https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:224: int limit_time, IndexIterator* iterator); On 2013/11/07 20:25:15, rdsmith wrote: > I'm bothered by the parallel names here for two functions that do fairly > different things. Maybe rename GetNewestFromBucket to UpdateIteratorFromBucket? I still feel this way; what are your thoughts? https://codereview.chromium.org/53313004/diff/1330001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1330001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:904: } I like this refactor. I'll note that it could be taken one step further, though I'm not certain without looking at the code whether that would be a good thing or not, so I'll just offer it as an idea for you to consider. The key function would be one that: * Took as arguments: ** no_use, low_use, and high_use iterators ** Pointers to counters for those three cases + evicted entries. ** A limit time. * Cycled through all buckets and all cells. * Updated iterators according to the use type of the cell and the limit time. That would be called by both GetOldest and GetNextCells; GetNextCells() would pass in the same iterator for all three cases, and null pointers for the counting, and GetOldest() would reverse that. To me the major con would be that the interface for the function would be pretty big (eight arguments). Pros would be collapsing the above two functions and simplifying GetOldest() and GetNextCells(), and making GetOldestFromBucket() not modify IndexTable (it looks like a getter; the mutating is a little surprising). I could imagine expanding iterator to keep a count of cells cycled through it, which would get rid of three arguments, and possibly turning the iterator argument into an array. But as I say, that all may or may not be worth it, so I'll leave that up to you. https://codereview.chromium.org/53313004/diff/1330001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.h (right): https://codereview.chromium.org/53313004/diff/1330001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:124: // the oldest / youngest group of cells from the current timestamp. When this nit, suggestion: "the oldest / youngest" -> "the next older / younger"? https://codereview.chromium.org/53313004/diff/1330001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:127: // timestamp for them. I'm still finding this comment a bit confusing, both in the references to functions above ("appropriate function" vs. "GetNextCells") and in the description of what IndexIterator does and how its used. The only interfaces that this can be used with by the consumer are GetNextCells() and GetOldest(), so if it's consumer targeted we should either refer to both those functions or make an abstract reference to what they each do. So I think I'd drop back a couple of yards, and describe what the iterator is/can be used for in the context of those functions. How about something like: "An index iterator holds a group of cells, all of which have the same timestamp. It can be used to iterate through the cells in an IndexTable in a given direction (forward or backward in time) via calls to GetNextCells(), returning on each call all the cells with the same timestamp." (See below as to why I'm leaving out GetOldest().) https://codereview.chromium.org/53313004/diff/1330001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:133: int timestamp; // The current low resolution timestamp for |cells|. Did you have any reaction to my suggestion of cross references to/from CalculateTimestamp below? https://codereview.chromium.org/53313004/diff/1330001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:209: IndexIterator* high_use); The implementation of this function initializes all the iterators and set the forward variable, which I believe means that the caller really doesn't care about the |forward| and |timestamp| portions of the IndexIterator. Looking at https://codereview.chromium.org/17507006, it looks like the interface to this function used to be CellLists. Why isn't it in this CL? That would seem to capture the information being passed back to the caller more precisely.
Thanks https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_... net/disk_cache/v3/index_table.cc:674: bucket_id = GetNextBucket(mask_ + 1, header()->max_bucket, extra_table_, On 2013/11/11 20:54:54, rdsmith wrote: > On 2013/11/08 04:22:30, rvargas wrote: > > On 2013/11/07 20:25:15, rdsmith wrote: > > > It took me a bit of time to figure out why the iteration was being done this > > > way, rather than just hammering through all the buckets in the index. So I > > > > What do you mean by hammering through all the buckets? > > Well, what I meant was |for (int i = 0; i < header()->max_bucket(); i++))| and > removing the call to GetNextBucket(). But I think that was based on an > incorrect understanding of the layout (i.e. that everything was contiguous in > main_table_; I hadn't put together the passing of extra_table_ with how > GetNextBucket() works). Ooops. Ok, let me reflect back what I now think is > going on: > * main_table_ points to a table of size mask_ + 1. That grows by doubling only. > * extra_table_ points to growth in the table when we don't need enough extra > entries to double the main table. This is "easier" growth, since it just means > extending the extra_table file, and everything else (other than the size of that > file/table) stays unaltered. There's a bit of a change when we switch pointers > from the old mapping of the extra table file to the new mapping, but because > they map the same file, it's fine for them to both be active for a bit. > * All bucket chains start in the main table, but all buckets except the first in > any chain will be in the extra table. Correct > > > I think we are hammering > > through all the buckets :). It's just that the structure of the table implies > > that to locate something on the extra table one has to start on a given bucket > > on the main table. So we could ignore the extra buckets when looking at the > main > > table, but we really cannot ignore the main table when looking at the extra > > table. Is that what you mean? > > Based on my new understanding I'll modify my thought (not even a suggestion; > just the alternative algorithm I went through in getting to my current > understanding). Could we write the above loop as?: > > for (int i = 0; i < static_cast<int32>(mask_ + 1); i++) > GetNewestFromBucket(&main_table_[i], i, current_time, iterator); > > for (int i = static_cast<int32>(mask_ + 1); i < header()->max_bucket; i++) { > IndexBucket* bucket = &extra_table_[i - static_cast<int32>(mask_ + 1)]; > GetNewestFromBucket(bucket, bucket->hash, current_time, iterator); > } > > ?? Not saying we should, just trying to make sure I understand the data > structure format. > Correct > > > > > think there could usefully be a comment somewhere in this file outlining the > > > algorithm for iterating through the index, given that that algorithm is used > > in > > > several places in the file. (I thought about suggesting we try to abstract > > that > > > algorithm in some fashion, but I couldn't come up with anything that didn't > > > involve function pointers/callbacks, which didn't strike me as worth it). > > > > I think I went through that same phase at some point, but the conclusion was > the > > same: attempting to generalize the code was not worth it / would just > complicate > > things. > > Makes sense. However, if I'm struggling as much as I am with getting solidly on > top of the iteration method and it's rationale, I'd like to have a comment > somewhere that sketches those two things out. Is there such a comment? If not, > would you be willing to put one in? Will do. https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_... File net/disk_cache/v3/index_table.h (right): https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_... net/disk_cache/v3/index_table.h:176: int CalculateTimestamp(base::Time time); On 2013/11/11 20:54:54, rdsmith wrote: > On 2013/11/07 02:02:07, rvargas wrote: > > On 2013/11/06 18:16:15, rdsmith wrote: > > > On 2013/11/06 02:23:03, rvargas wrote: > > > > The issue is that I intentionally wanted to keep the iterators as dumb > > structs > > > > (even though now they have hidden ctor/dtors for reasons beyond my > > > comprehension > > > > at this time). > > > > > > Why keep them as dumb structs? I understand the motivation for the disk > > format > > > structs, but why the iterator? > > > > It's just that I see this as a storage for iteration-related data, not really > as > > an object that is trying to encapsulate something. It's main purpose is to > avoid > > passing a bunch of separate arguments to all the functions that are involved > > with enumerations. > > Hmmm. Makes sense. I think part of why I pull in the other direction is just > because I find it useful to conceptually bundle groups of related functions > together. Specifically InitIterator and UpdateIterator are very much "work with > the iterator data structure", so I feel a pull to declare them as methods of > iterator. (InitIterator() could be the Iterator constructor with just a bit of > tweaking). I could imagine those being fine as methods on a struct, but that's > a slippery slope to full class status. So it's your call. I thought you were going to say something like that :) The thing is that both methods would be private to the iterator, because what they do is an implementation detail of the code that returns iterators... is not something that the reader of the header (user of the class) should worry about... or ever directly use. That leads directly to making the struct a class, with a friend, and for the caller to benefit it would now need actual public methods thrown into the mix. I think I still prefer the trivial, fully open view for the caller. https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.h (right): https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:224: int limit_time, IndexIterator* iterator); On 2013/11/11 20:54:54, rdsmith wrote: > On 2013/11/07 20:25:15, rdsmith wrote: > > I'm bothered by the parallel names here for two functions that do fairly > > different things. Maybe rename GetNewestFromBucket to > UpdateIteratorFromBucket? > > I still feel this way; what are your thoughts? Sorry, I missed this after the last refactor. https://codereview.chromium.org/53313004/diff/1330001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1330001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:904: } On 2013/11/11 20:54:54, rdsmith wrote: > I like this refactor. I'll note that it could be taken one step further, though > I'm not certain without looking at the code whether that would be a good thing > or not, so I'll just offer it as an idea for you to consider. The key function > would be one that: > > * Took as arguments: > ** no_use, low_use, and high_use iterators > ** Pointers to counters for those three cases + evicted entries. > ** A limit time. > * Cycled through all buckets and all cells. > * Updated iterators according to the use type of the cell and the limit time. > > That would be called by both GetOldest and GetNextCells; GetNextCells() would > pass in the same iterator for all three cases, and null pointers for the > counting, and GetOldest() would reverse that. To me the major con would be that > the interface for the function would be pretty big (eight arguments). Pros > would be collapsing the above two functions and simplifying GetOldest() and > GetNextCells(), and making GetOldestFromBucket() not modify IndexTable (it looks > like a getter; the mutating is a little surprising). > > I could imagine expanding iterator to keep a count of cells cycled through it, > which would get rid of three arguments, and possibly turning the iterator > argument into an array. But as I say, that all may or may not be worth it, so > I'll leave that up to you. My first reaction was not to grow a do-all method. However, one method was basically a subset of the other one, so merged. https://codereview.chromium.org/53313004/diff/1330001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.h (right): https://codereview.chromium.org/53313004/diff/1330001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:127: // timestamp for them. On 2013/11/11 20:54:54, rdsmith wrote: > I'm still finding this comment a bit confusing, both in the references to > functions above ("appropriate function" vs. "GetNextCells") and in the > description of what IndexIterator does and how its used. The only interfaces > that this can be used with by the consumer are GetNextCells() and GetOldest(), > so if it's consumer targeted we should either refer to both those functions or > make an abstract reference to what they each do. So I think I'd drop back a > couple of yards, and describe what the iterator is/can be used for in the > context of those functions. How about something like: > > "An index iterator holds a group of cells, all of which have the same timestamp. > It can be used to iterate through the cells in an IndexTable in a given > direction (forward or backward in time) via calls to GetNextCells(), returning > on each call all the cells with the same timestamp." > > (See below as to why I'm leaving out GetOldest().) Another attempt :) https://codereview.chromium.org/53313004/diff/1330001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:133: int timestamp; // The current low resolution timestamp for |cells|. On 2013/11/11 20:54:54, rdsmith wrote: > Did you have any reaction to my suggestion of cross references to/from > CalculateTimestamp below? I don't particularly want to cross reference individual members... and the structure itself is already cross referenced with the methods that return it; I added some reference to the timestamp on the description of GetNextCells. I'm adding an explicit reference to the iterator from CalculateTimestamp. Note that I'm not really trying to hide the existence of a table/cell timestamp, so both the table and the cell have methods that return a timestamp. https://codereview.chromium.org/53313004/diff/1330001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:209: IndexIterator* high_use); On 2013/11/11 20:54:54, rdsmith wrote: > The implementation of this function initializes all the iterators and set the > forward variable, which I believe means that the caller really doesn't care > about the |forward| and |timestamp| portions of the IndexIterator. Looking at > https://codereview.chromium.org/17507006, it looks like the interface to this > function used to be CellLists. Why isn't it in this CL? That would seem to > capture the information being passed back to the caller more precisely. Up to patch set 4, this method returned just the cells. However, unifying what to do with a cell when doing enumerations meant that the signature of this method had to change because otherwise the whole returned lists would have to be copied just before returning (an iterator cannot keep a reference to a list, it is the owner of the list). OK, I could swap the lists at that time... but the caller is keeping the timestamps anyway because they are needed to select what to delete, so the only thing that is really not used is the direction... or to view it from the other side, this method returns three iterators, it just ignores their initial values. I believe the reason this method was using lists directly is that I probably wrote it before creating the iterators.
The iteration stuff all looks good to me (with the nits below around the comment on GetNextCells()). I'm sorry this is taking so long, but I really do think we're making progress. That progress may be primarily measured in sections of the cache that I'm successfully wrapping my brain around, but that's a progress measure :-}. I think once I successfully wrap my brain around the address formats and the small vs. large table distinction I'll be in a position to do one large push and finish up the review. https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:250: Addr EntryCell::GetAddress() const { I spent some time wrapping my brain around this function trying to understand the format difference between the small table and the large table. From disk_format_v3.h the only difference is the number of bits allocated to the hash and the address (small->large hash/addr: 16/16 -> 10/22). But looking at this function, it looks like we're also shifting from situation where we aren't using separate files (i.e. everything is always in either the entries file or the evicted entries file) to one in which the file is encoded in the address from the IndexCell. This right? If so, there looks to be a subtle contradiction between two different format descriptions. addr.h indicates taht it's reserving 8 bits in a CacheAddr for the file selector. But the specification in disk_format_v3.h implies that we're only using six of those eight bits (if the bottom 16 bits are for the block, that only leaves six bits for the file). Is that right? So a general concern (sorta implied by how long it took me to puzzle out the above) is around the documentation of the different forms of address and their relationship to each other, and the documentation of the difference in file usage for small/large table size. Is this done somewhere and I missed it? If not, could it be done somewhere? (After writing this, I went back and read the comment at the top of the file, which I hadn't understood on my first pass through, but now that I've re-read it having better grokked the file formats and small/large table distinction I understand better. But I still feels like it should be easier to really grok the file formats from reading through addr.h, disk_format_base.h, and disk_format_v3.h, though that may be just my failure in stumbling through that documentation in the wrong order.) https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:250: Addr EntryCell::GetAddress() const { Another random table format question: How do you get an Addr() with number of contiguous blocks != 1? If you have such a thing, how do you store it in the bits allocated for the address in an IndexCell? https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:262: return Addr::FromEntryAddress(address_value); For both these calls, it feels like "From*Address" is a bit of a misnomer, both because "address" means several different things in this code, and because the address_value means different things in the small and large table formats. Maybe From*FileIndexAddress? https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.h (right): https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:186: EntryCell FindEntryCell(uint32 hash, Addr address); I tried to grok the different usages of CreateEntryCell() and FindEntryCell() by looking at the reference v3 CL, but I couldn't get it. They look different enough that my initial thought of "Would a single FindOrCreate interface work?" almost certainly wasn't a good one. But I'm not certain I understand why; could you give me a sense as to how these will usually be used? Specifically, it seems like if you're doing a CreateEntryCell(), you've already done the hard work if you've figured out the address, so it sounds like CreateEntryCell should only be called from code that actually does allocation within the cache (which is the backend??). FindEntryCell you can do when you have the address, but where are you likely to have gotten the address from, if not the cache? https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:205: // for the provided iterators are ignored. Where is the boundary between low usage and high usage defined? https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:211: // that this method does not return the cells matching the provided timestamp, nit(s): I'd leave out the "keep in mind" since I don't think that information is actually given anywhere else, change "the provided timestamp" to "the iterator's timestamp", and change "(or before)" to "(or before, depending on the value of the iterator's forward member)". With that, I think this documentation + the iterator documentation is good; are those changes ok with you?
Thanks https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:250: Addr EntryCell::GetAddress() const { On 2013/11/13 21:14:12, rdsmith wrote: > I spent some time wrapping my brain around this function trying to understand > the format difference between the small table and the large table. From > disk_format_v3.h the only difference is the number of bits allocated to the hash > and the address (small->large hash/addr: 16/16 -> 10/22). But looking at this > function, it looks like we're also shifting from situation where we aren't > using separate files (i.e. everything is always in either the entries file or > the evicted entries file) to one in which the file is encoded in the address > from the IndexCell. This right? The issue is that the small table format only supports up to 64k entries. 64k entries is also roughly the limit of entries that can be stored by a single block file, and given that other parts of the design assign a specific file type for entries, the net result is that the block file type and number are fixed for the small table case. It is a case of implied bits that don't have to be stored. Large tables of course support more than 64k entries, and that means that the file number is also part of what changes in the address, and has to be stored. Now, ideally, an Address should be transparent for this code... and from the point of view of the index format it shouldn't really matter what specific bits of that address mean. On the other hand, the Address itself shouldn't care if it is being stored by an index or not. In other words, either part of the definition of Address leaks here, or details of the index leak into Address. So what this code says is: - for small tables, _I_ (this method) know that some bits here are fixed, and the Address API gives me a clear way to set them as needed. - for large tables, _I_ know how many bits I have on the table, and I know that they are not enough bits to overflow the "control" part of an Address, but I'll let the Address create the right object though a dedicated method because the normal API doesn't have what I need. We could add more methods to Address to make it aware of small tables (or add more arguments to the new methods), but between the two files I think it is better to leak from Address to Index than the other way. I could use Addr::FromEvictedAddress even for small tables after massaging the value, but I don't think that's what you want. > > If so, there looks to be a subtle contradiction between two different format > descriptions. addr.h indicates that it's reserving 8 bits in a CacheAddr for > the file selector. But the specification in disk_format_v3.h implies that we're > only using six of those eight bits (if the bottom 16 bits are for the block, > that only leaves six bits for the file). Is that right? Sort of. The low order 24 bits of an address are "pure" addressing bits in that they represent the addressing capacity of the format. The upper bits are some sort of control, but an address cannot enumerate more than 2^24 entities. BlockFiles is the one who really cares about the distinction between the low order 16 bits and the file number. So, overall, this code is saying that it can store up to 2^22 entries (active + deleted) so the high order bits of the file number are always going to be 0. > > So a general concern (sorta implied by how long it took me to puzzle out the > above) is around the documentation of the different forms of address and their > relationship to each other, and the documentation of the difference in file > usage for small/large table size. Is this done somewhere and I missed it? If > not, could it be done somewhere? Sorry about that. There is some text in disk_format_v3.h about what an index address is and how that is translated to Addr, but I expanded that comment with an example, and also extended the description on addr.h > > (After writing this, I went back and read the comment at the top of the file, > which I hadn't understood on my first pass through, but now that I've re-read it > having better grokked the file formats and small/large table distinction I > understand better. But I still feels like it should be easier to really grok > the file formats from reading through addr.h, disk_format_base.h, and > disk_format_v3.h, though that may be just my failure in stumbling through that > documentation in the wrong order.) https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:250: Addr EntryCell::GetAddress() const { On 2013/11/13 21:14:12, rdsmith wrote: > Another random table format question: How do you get an Addr() with number of > contiguous blocks != 1? If you have such a thing, how do you store it in the > bits allocated for the address in an IndexCell? That case corresponds for example to storing 800 bytes of data somewhere. In that case, block size is 256 and there are 4 blocks used for the record. The index table only stores records of active or deleted entries, and they have a fixed size, so they will always use just one block. https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:262: return Addr::FromEntryAddress(address_value); On 2013/11/13 21:14:12, rdsmith wrote: > For both these calls, it feels like "From*Address" is a bit of a misnomer, both > because "address" means several different things in this code, and because the > address_value means different things in the small and large table formats. > Maybe From*FileIndexAddress? The name follows what is available on addr.h :( I can change it if you feel strongly about it, but the only things Addr knows about is BLOCK_ENTRIES and BLOCK_EVICTED, it knows nothing about the index. (of course, that is not completely true, as it knows what it means to grab value and put it into a meaningless Addr) FromEntryFileAddress could work, but I don't like that it's not clear what "file" is doing there... as in that word sticks with Entry (correct), but also with Address (wrong) Another option would be to not use "address" for the index, but I could not come up with another term https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.h (right): https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:186: EntryCell FindEntryCell(uint32 hash, Addr address); On 2013/11/13 21:14:12, rdsmith wrote: > I tried to grok the different usages of CreateEntryCell() and FindEntryCell() by > looking at the reference v3 CL, but I couldn't get it. They look different > enough that my initial thought of "Would a single FindOrCreate interface work?" > almost certainly wasn't a good one. But I'm not certain I understand why; could > you give me a sense as to how these will usually be used? Specifically, it > seems like if you're doing a CreateEntryCell(), you've already done the hard > work if you've figured out the address, so it sounds like CreateEntryCell should > only be called from code that actually does allocation within the cache (which > is the backend??). FindEntryCell you can do when you have the address, but > where are you likely to have gotten the address from, if not the cache? There should be only two callers for CreateEntryCell and that is the backend creating a new entry (after making sure that the entry doesn't exist), and creating an evicted entry record, when evicting a normal entry. On the other hand, FindEntryCell is used to make sure that something that was valid some time ago is still valid right now (this is explained with the declaration of EntryCell). Basically, after calling LookupEntries you end up with a bunch of records that exist right now (hopefully just one), but the caller now has to go open that file and read it, and that takes time. By the time that operation completes, the caller should not use the same EntryCell right away, or at least not write to it, because it may have moved. So it has to grab the hash and address and do a quick look up to get the current location of the cell, and then it can write to it. https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:205: // for the provided iterators are ignored. On 2013/11/13 21:14:12, rdsmith wrote: > Where is the boundary between low usage and high usage defined? That's an implementation detail of the eviction algorithm (eviction_v3.cc) https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:211: // that this method does not return the cells matching the provided timestamp, On 2013/11/13 21:14:12, rdsmith wrote: > nit(s): I'd leave out the "keep in mind" since I don't think that information is I meant it to be "as implied by the name" (otherwise it would be GetCurrentCells or something like that) :) dropped. > actually given anywhere else, change "the provided timestamp" to "the iterator's > timestamp", and change "(or before)" to "(or before, depending on the value of changed the wording slightly because I think it is important to keep the distinction between the value when it is called and on return. > the iterator's forward member)". With that, I think this documentation + the > iterator documentation is good; are those changes ok with you?
Responses to comments on the interfaces; I still need to do a detailed review of the implementation. https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:250: Addr EntryCell::GetAddress() const { On 2013/11/14 02:54:19, rvargas wrote: > We could add more methods to Address to make it aware of small tables (or add > more arguments to the new methods), but between the two files I think it is > better to leak from Address to Index than the other way. I could use > Addr::FromEvictedAddress even for small tables after massaging the value, but I > don't think that's what you want. It might be, actually. I agree with you that given that we're going to leak some information in one direction, I'd rather leak from Address to IndexTable than the other way around (ideally I'd like to not leak at all, but I haven't come up with any suggestions as to how to do that :-}). But we still have two different methods on address that look, from where I'm sitting, like they are targeting the two different formats of address in IndexTable. Looking just at the addr.h header, it's not clear why we need both the Addr constructor and the static methods Addr::From*Address(). So if you'd be comfortable either breaking the block portion of address_value out and using the Addr() constructor in place of From*Address (which means a bit more leakage of knowledge into IndexTable, specifically how multi-file address values are split between file # and block #, but we've already gestured at that in our use of the constructor) or the reverse (which leaks the same information) I'd personally prefer that, as it'd make the Addr interface not have any hints of index table formats in it. Alternatively, documentation in addr.h explaining what the From*Address() are useful for without (explicitly or implicitly) being dependent on the IndexTable format would be fine. And, finally, I'm not fanatic about this--if you really want to leave it this way, I'm ok with that. But it feels to me as if we're leaking information both ways currently, and if we can I'd like to clean that up at least a little bit. https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:262: return Addr::FromEntryAddress(address_value); On 2013/11/14 02:54:19, rvargas wrote: > On 2013/11/13 21:14:12, rdsmith wrote: > > For both these calls, it feels like "From*Address" is a bit of a misnomer, > both > > because "address" means several different things in this code, and because the > > address_value means different things in the small and large table formats. > > Maybe From*FileIndexAddress? > > The name follows what is available on addr.h :( I can change it if you feel > strongly about it, but the only things Addr knows about is BLOCK_ENTRIES and > BLOCK_EVICTED, it knows nothing about the index. (of course, that is not > completely true, as it knows what it means to grab value and put it into a > meaningless Addr) > > FromEntryFileAddress could work, but I don't like that it's not clear what > "file" is doing there... as in that word sticks with Entry (correct), but also > with Address (wrong) > > Another option would be to not use "address" for the index, but I could not come > up with another term Maybe a comment inside the "if (small_table_)" conditional saying "address_value is just a block index in the small table format, but is a full address (including file number) in the large table format." As I understand it, the misnomer here is that address_value isn't a full address (including file number) in the small table format case. https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.h (right): https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:186: EntryCell FindEntryCell(uint32 hash, Addr address); On 2013/11/14 02:54:19, rvargas wrote: > On 2013/11/13 21:14:12, rdsmith wrote: > > I tried to grok the different usages of CreateEntryCell() and FindEntryCell() > by > > looking at the reference v3 CL, but I couldn't get it. They look different > > enough that my initial thought of "Would a single FindOrCreate interface > work?" > > almost certainly wasn't a good one. But I'm not certain I understand why; > could > > you give me a sense as to how these will usually be used? Specifically, it > > seems like if you're doing a CreateEntryCell(), you've already done the hard > > work if you've figured out the address, so it sounds like CreateEntryCell > should > > only be called from code that actually does allocation within the cache (which > > is the backend??). FindEntryCell you can do when you have the address, but > > where are you likely to have gotten the address from, if not the cache? > > There should be only two callers for CreateEntryCell and that is the backend > creating a new entry (after making sure that the entry doesn't exist), and > creating an evicted entry record, when evicting a normal entry. > > On the other hand, FindEntryCell is used to make sure that something that was > valid some time ago is still valid right now (this is explained with the > declaration of EntryCell). > > Basically, after calling LookupEntries you end up with a bunch of records that > exist right now (hopefully just one), but the caller now has to go open that > file and read it, and that takes time. By the time that operation completes, the > caller should not use the same EntryCell right away, or at least not write to > it, because it may have moved. So it has to grab the hash and address and do a > quick look up to get the current location of the cell, and then it can write to > it. So to reflect back, CreateEntryCell creates a new entry in the bucket chain for the given hash given the address, and FindEntryCell confirms that there already exists an entry in the bucket chain for the given hash that matches the address, and returns it, returning a null EntryCell() if there is no such entry? That makes sense. Would you be willing to add a note about the null EntryCell() return case in the comment in the .h file, and possibly about the normal usage for FindEntryCell() to be to confirm the existence of the given entry at that location after an asynchronous operation? https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:205: // for the provided iterators are ignored. On 2013/11/14 02:54:19, rvargas wrote: > On 2013/11/13 21:14:12, rdsmith wrote: > > Where is the boundary between low usage and high usage defined? > > That's an implementation detail of the eviction algorithm (eviction_v3.cc) So why is it exposed on a public interface? (Not a question for this CL, more of an overall design question.) If it's exposed on a public interface, it seems like it should have some sort of, maybe handwavy, definition for the callers of that interface. https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:211: // that this method does not return the cells matching the provided timestamp, On 2013/11/14 02:54:19, rvargas wrote: > On 2013/11/13 21:14:12, rdsmith wrote: > > nit(s): I'd leave out the "keep in mind" since I don't think that information > is > > I meant it to be "as implied by the name" (otherwise it would be GetCurrentCells > or something like that) :) dropped. > > > actually given anywhere else, change "the provided timestamp" to "the > iterator's > > timestamp", and change "(or before)" to "(or before, depending on the value of > > changed the wording slightly because I think it is important to keep the > distinction between the value when it is called and on return. > > > the iterator's forward member)". With that, I think this documentation + the > > iterator documentation is good; are those changes ok with you? > The revised doc looks good to me.
https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:250: Addr EntryCell::GetAddress() const { On 2013/11/18 20:37:04, rdsmith wrote: > On 2013/11/14 02:54:19, rvargas wrote: > > We could add more methods to Address to make it aware of small tables (or add > > more arguments to the new methods), but between the two files I think it is > > better to leak from Address to Index than the other way. I could use > > Addr::FromEvictedAddress even for small tables after massaging the value, but > I > > don't think that's what you want. > > It might be, actually. I agree with you that given that we're going to leak > some information in one direction, I'd rather leak from Address to IndexTable > than the other way around (ideally I'd like to not leak at all, but I haven't > come up with any suggestions as to how to do that :-}). But we still have two > different methods on address that look, from where I'm sitting, like they are > targeting the two different formats of address in IndexTable. Looking just at > the addr.h header, it's not clear why we need both the Addr constructor and the > static methods Addr::From*Address(). So if you'd be comfortable either breaking > the block portion of address_value out and using the Addr() constructor in place > of From*Address (which means a bit more leakage of knowledge into IndexTable, > specifically how multi-file address values are split between file # and block #, > but we've already gestured at that in our use of the constructor) or the reverse > (which leaks the same information) I'd personally prefer that, as it'd make the > Addr interface not have any hints of index table formats in it. > > Alternatively, documentation in addr.h explaining what the From*Address() are > useful for without (explicitly or implicitly) being dependent on the IndexTable > format would be fine. > > And, finally, I'm not fanatic about this--if you really want to leave it this > way, I'm ok with that. But it feels to me as if we're leaking information both > ways currently, and if we can I'd like to clean that up at least a little bit. Done. I'll cleanup Addr if we are happy here. https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.h (right): https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:186: EntryCell FindEntryCell(uint32 hash, Addr address); On 2013/11/18 20:37:04, rdsmith wrote: > On 2013/11/14 02:54:19, rvargas wrote: > > On 2013/11/13 21:14:12, rdsmith wrote: > > > I tried to grok the different usages of CreateEntryCell() and > FindEntryCell() > > by > > > looking at the reference v3 CL, but I couldn't get it. They look different > > > enough that my initial thought of "Would a single FindOrCreate interface > > work?" > > > almost certainly wasn't a good one. But I'm not certain I understand why; > > could > > > you give me a sense as to how these will usually be used? Specifically, it > > > seems like if you're doing a CreateEntryCell(), you've already done the hard > > > work if you've figured out the address, so it sounds like CreateEntryCell > > should > > > only be called from code that actually does allocation within the cache > (which > > > is the backend??). FindEntryCell you can do when you have the address, but > > > where are you likely to have gotten the address from, if not the cache? > > > > There should be only two callers for CreateEntryCell and that is the backend > > creating a new entry (after making sure that the entry doesn't exist), and > > creating an evicted entry record, when evicting a normal entry. > > > > On the other hand, FindEntryCell is used to make sure that something that was > > valid some time ago is still valid right now (this is explained with the > > declaration of EntryCell). > > > > Basically, after calling LookupEntries you end up with a bunch of records that > > exist right now (hopefully just one), but the caller now has to go open that > > file and read it, and that takes time. By the time that operation completes, > the > > caller should not use the same EntryCell right away, or at least not write to > > it, because it may have moved. So it has to grab the hash and address and do a > > quick look up to get the current location of the cell, and then it can write > to > > it. > > So to reflect back, CreateEntryCell creates a new entry in the bucket chain for > the given hash given the address, and FindEntryCell confirms that there already > exists an entry in the bucket chain for the given hash that matches the address, > and returns it, returning a null EntryCell() if there is no such entry? That > makes sense. Would you be willing to add a note about the null EntryCell() > return case in the comment in the .h file, and possibly about the normal usage > for FindEntryCell() to be to confirm the existence of the given entry at that > location after an asynchronous operation? > I had already modified the comment to reflect use (not sure if you saw that). Added the point about return on failure. https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:205: // for the provided iterators are ignored. On 2013/11/18 20:37:04, rdsmith wrote: > On 2013/11/14 02:54:19, rvargas wrote: > > On 2013/11/13 21:14:12, rdsmith wrote: > > > Where is the boundary between low usage and high usage defined? > > > > That's an implementation detail of the eviction algorithm (eviction_v3.cc) > > So why is it exposed on a public interface? (Not a question for this CL, more > of an overall design question.) If it's exposed on a public interface, it > seems like it should have some sort of, maybe handwavy, definition for the > callers of that interface. I'm not sure I follow. It is not defined in an interface, but clearly the caller should be able to get the data out of the table. What is defined is that each entry has a use counter, a fetch counter and a group. (cells don't have fetch counter, but have a group). The caller is able to set the group and counter at will, retrieve it from a particular cell, and get all cells from all groups given a time constrain (this method). The mapping of counters to groups is the detail that is private to the eviction code. The only detail about the arguments is that the method doesn't return evicted entries, in part because I have not written the code to delete evicted entries yet. In that sense, this method provides strictly what the caller needs so it's not really general purpose, but I think that's fine.
Part-way through reviewing Init(), and getting somewhat bogged down in understanding the use of hash values, so pausing to provide these comments and ask for clarifications on hash. I believe that: * The hash for a particular key is 32 bits, generated by base::Hash(key) * The bottom bits (a variable number of bits, depending on table_len/mask_) are used as the address of the relevant bucket for that entry. * The top bits (32 - kHash{SmallTable,}Shift) are stored in the IndexCell and compared on cell lookup. That shift means that we're comparing just the number of bits stored in the table for each size; i.e. kHash{SmallTable,}Shift is dependent on the number of bits stored for the hash in the different types of table/cells. * The top and bottom bits in this scheme may overlap. * When we increase the table size via doubling (i.e. increase the number of buckets), that may mean that the extra bit (or bits, if we haven't touched a cell from previous doublings) used in indexing into the table may be wrong, and the cell may need to be moved. This is what the MisplacedCell() logic is about. That all correct? I feel as if both address and hash could benefit from a comment somewhere indicating that they mean different things in different parts of the code, with an explanation as to what those differences are (if this is in the design doc, that's fine, and my apologies for forgetting it). https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:250: Addr EntryCell::GetAddress() const { On 2013/11/18 23:43:34, rvargas wrote: > On 2013/11/18 20:37:04, rdsmith wrote: > > On 2013/11/14 02:54:19, rvargas wrote: > > > We could add more methods to Address to make it aware of small tables (or > add > > > more arguments to the new methods), but between the two files I think it is > > > better to leak from Address to Index than the other way. I could use > > > Addr::FromEvictedAddress even for small tables after massaging the value, > but > > I > > > don't think that's what you want. > > > > It might be, actually. I agree with you that given that we're going to leak > > some information in one direction, I'd rather leak from Address to IndexTable > > than the other way around (ideally I'd like to not leak at all, but I haven't > > come up with any suggestions as to how to do that :-}). But we still have two > > different methods on address that look, from where I'm sitting, like they are > > targeting the two different formats of address in IndexTable. Looking just at > > the addr.h header, it's not clear why we need both the Addr constructor and > the > > static methods Addr::From*Address(). So if you'd be comfortable either > breaking > > the block portion of address_value out and using the Addr() constructor in > place > > of From*Address (which means a bit more leakage of knowledge into IndexTable, > > specifically how multi-file address values are split between file # and block > #, > > but we've already gestured at that in our use of the constructor) or the > reverse > > (which leaks the same information) I'd personally prefer that, as it'd make > the > > Addr interface not have any hints of index table formats in it. > > > > Alternatively, documentation in addr.h explaining what the From*Address() are > > useful for without (explicitly or implicitly) being dependent on the > IndexTable > > format would be fine. > > > > And, finally, I'm not fanatic about this--if you really want to leave it this > > way, I'm ok with that. But it feels to me as if we're leaking information > both > > ways currently, and if we can I'd like to clean that up at least a little bit. > > Done. I'll cleanup Addr if we are happy here. Thanks! Yes, this looks much better. Go ahead and cleanup Addr. https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.h (right): https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:186: EntryCell FindEntryCell(uint32 hash, Addr address); On 2013/11/18 23:43:34, rvargas wrote: > On 2013/11/18 20:37:04, rdsmith wrote: > > On 2013/11/14 02:54:19, rvargas wrote: > > > On 2013/11/13 21:14:12, rdsmith wrote: > > > > I tried to grok the different usages of CreateEntryCell() and > > FindEntryCell() > > > by > > > > looking at the reference v3 CL, but I couldn't get it. They look > different > > > > enough that my initial thought of "Would a single FindOrCreate interface > > > work?" > > > > almost certainly wasn't a good one. But I'm not certain I understand why; > > > could > > > > you give me a sense as to how these will usually be used? Specifically, > it > > > > seems like if you're doing a CreateEntryCell(), you've already done the > hard > > > > work if you've figured out the address, so it sounds like CreateEntryCell > > > should > > > > only be called from code that actually does allocation within the cache > > (which > > > > is the backend??). FindEntryCell you can do when you have the address, > but > > > > where are you likely to have gotten the address from, if not the cache? > > > > > > There should be only two callers for CreateEntryCell and that is the backend > > > creating a new entry (after making sure that the entry doesn't exist), and > > > creating an evicted entry record, when evicting a normal entry. > > > > > > On the other hand, FindEntryCell is used to make sure that something that > was > > > valid some time ago is still valid right now (this is explained with the > > > declaration of EntryCell). > > > > > > Basically, after calling LookupEntries you end up with a bunch of records > that > > > exist right now (hopefully just one), but the caller now has to go open that > > > file and read it, and that takes time. By the time that operation completes, > > the > > > caller should not use the same EntryCell right away, or at least not write > to > > > it, because it may have moved. So it has to grab the hash and address and do > a > > > quick look up to get the current location of the cell, and then it can write > > to > > > it. > > > > So to reflect back, CreateEntryCell creates a new entry in the bucket chain > for > > the given hash given the address, and FindEntryCell confirms that there > already > > exists an entry in the bucket chain for the given hash that matches the > address, > > and returns it, returning a null EntryCell() if there is no such entry? That > > makes sense. Would you be willing to add a note about the null EntryCell() > > return case in the comment in the .h file, and possibly about the normal usage > > for FindEntryCell() to be to confirm the existence of the given entry at that > > location after an asynchronous operation? > > > > I had already modified the comment to reflect use (not sure if you saw that). > Added the point about return on failure. Yeah, sorry, missed the mod. This looks good. https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:205: // for the provided iterators are ignored. On 2013/11/18 23:43:34, rvargas wrote: > On 2013/11/18 20:37:04, rdsmith wrote: > > On 2013/11/14 02:54:19, rvargas wrote: > > > On 2013/11/13 21:14:12, rdsmith wrote: > > > > Where is the boundary between low usage and high usage defined? > > > > > > That's an implementation detail of the eviction algorithm (eviction_v3.cc) > > > > So why is it exposed on a public interface? (Not a question for this CL, more > > of an overall design question.) If it's exposed on a public interface, it > > seems like it should have some sort of, maybe handwavy, definition for the > > callers of that interface. > > I'm not sure I follow. It is not defined in an interface, but clearly the caller > should be able to get the data out of the table. What is defined is that each > entry has a use counter, a fetch counter and a group. (cells don't have fetch > counter, but have a group). The caller is able to set the group and counter at > will, retrieve it from a particular cell, and get all cells from all groups > given a time constrain (this method). The mapping of counters to groups is the > detail that is private to the eviction code. Ah, ok, I think I follow. The category that an entry is in can be set externally to that entry, but is limited to values in the EntryGroup enum. The eviction code is the only code that sets those values, and is the only consumer of this interface. So there's a way in which this interface is hiding consumer information from the producer, not the other way around. Just looking at this interface in isolation, I'd value having a comment reference indicating that the iterator into which individual entries are placed depends in the EntryGroup value associated with them. But I could easily imagine that anyone who was doing any extended work in the cache would recognize that *_use referred to EntryGroup, and hence wouldn't need the reference. So I'll raise that as an idea and leave it up to you whether to take it or not. > The only detail about the arguments is that the method doesn't return evicted > entries, in part because I have not written the code to delete evicted entries > yet. In that sense, this method provides strictly what the caller needs so it's > not really general purpose, but I think that's fine. Yeah, I agree. It's also very much targeted at the eviction code--no other caller will really have any use for it. In an ideal world, there'd be a private pipe for that information from IndexTable->eviction code so that other consumers wouldn't be distracted by/have access to the interface, but I think that's way more trouble than it's worth. https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:454: DCHECK_GE(extra_size, 0); Does this mean we always have to incrementally grow the extra table in between doublings of the main table? Why put that restriction on it? https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:463: memset(params->extra_table, 0, extra_size * sizeof(IndexBucket)); It feels to me like there are assumptions in this code about aliasing, but I'm not certain what they are. Let me go over what I believe the context for this code is and make comments conditional on that. I believe that this function will be called in only three situations: * First initialization. main_table_ will be null, params->main_table will point to memory mapped from a file, and params->extra_table will be null. * Growth of extra table. params->main_table will be NULL. params->extra_table != extra_table_, but will point to a memory mapped region of the same file, only larger. Thus any changes made to params->extra_table will be reflected in extra_table_ and vice versa (for the length that extra_table_ is mapped for). extra_table_ will be unmapped after this function returns (does the calling code have it's own pointer to extra_table_?) * Doubling of main table. params->main_table != main_table_, but will point to a memory mapped region of the same file, only doubled in size. params->extra_table will point to a section of memory mapped to the extra table file, that hasn't changed in size any. params->extra_table != extra_table_ and the location pointed to by extra_table_ will be unmapped after this routine finishes (??), but the two memory regions will be direct aliases to one another. (I'll probably want a comment clarifying these assumptions somewhere, but I'll wait to suggest what until I'm certain I understand it all. :-}) Assuming that that's correct, why refer to the same data through two different points (extra_table_ and params->extra_table)? It seems like it's just going to generate confusion. https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:473: DCHECK_LE(extra_bits_, 11); What results in this 11? This seems like you're simply saying that you can't have a table_len greater than 32 bits, in which case don't you want to just say that? (I.e. IIRTCC the 11 is dependent on a specific value of kBaseTableLen, and that dependency should be explicit in the DCHECK). https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:475: small_table_ = extra_bits_ < kHashShift - kHashSmallTableShift; In other words, while the number of bits that are thrown away before making the comparison between the hash and the IndexCell partial hash is greater than the number of bits used to index into the table, leave the table small, but when we start using all those thrown-away bits and crowding into the comparison section, switch to the large table? (I'll note that this seems like a different metric for the small/large table distinction that I'd previously understood--I thought it was about whether or not the entries will always be stored in one file. But looking at disk_format_v3.h, it seems like the distinction is specified purely by the number of entries. If there's a subtle background connection between these three things, it should probably be made explicit, or if one of them isn't relevant, that should be made clear too. Based on how much pain it took me to figure out, I'll probably also want an explanatory comment about what hashes mean (in the different contexts) somewhere, but I'll ask for that once I'm confident I actually understand how they're used.) https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:485: // creating the bitmaps, clear the part of the extra table. nit, suggestion: "clear the part of the bitmap referring to the extra table"? https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:502: DCHECK_GE(num_words, old_num_words); Under what circumstances will they be equal? Does that imply that we could be doubling table size in a situation where the extra_table is the same size as the added main table size? https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.h (right): https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:264: int extra_bits_; // How many bits are in mask_ above the default value. I could certainly be reading the code incorrectly, but this comment seems incorrect to me. From IndexTable::Init: extra_bits_ = base::bits::Log2Floor(header_->table_len) - base::bits::Log2Floor(kBaseTableLen); DCHECK_GE(extra_bits_, 0); DCHECK_LE(extra_bits_, 11); mask_ = ((kBaseTableLen / kCellsPerBucket) << extra_bits_) - 1; // <<-- small_table_ = extra_bits_ < kHashShift - kHashSmallTableShift; if (!small_table_) extra_bits_ -= kHashShift - kHashSmallTableShift; At the marked line, the relationship in the comment seems accurate to me. But if we don't have a small table, we reduce extra_bits_ by the different in hash shift between small and large tables, which would seem to make the comment invalid. What am I missing?
https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:454: DCHECK_GE(extra_size, 0); On 2013/11/25 19:48:07, rdsmith wrote: > Does this mean we always have to incrementally grow the extra table in between > doublings of the main table? Why put that restriction on it? Nope. It can be zero. https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:463: memset(params->extra_table, 0, extra_size * sizeof(IndexBucket)); On 2013/11/25 19:48:07, rdsmith wrote: > It feels to me like there are assumptions in this code about aliasing, but I'm > not certain what they are. Let me go over what I believe the context for this > code is and make comments conditional on that. > > I believe that this function will be called in only three situations: > * First initialization. main_table_ will be null, params->main_table will point > to memory mapped from a file, and params->extra_table will be null. params->extra_table should not be null. > * Growth of extra table. params->main_table will be NULL. params->extra_table > != extra_table_, but will point to a memory mapped region of the same file, only > larger. Thus any changes made to params->extra_table will be reflected in > extra_table_ and vice versa (for the length that extra_table_ is mapped for). > extra_table_ will be unmapped after this function returns (does the calling code > have it's own pointer to extra_table_?) Yes, someone will know how to unmap the old table after this method returns. > * Doubling of main table. params->main_table != main_table_, but will point to > a memory mapped region of the same file, only doubled in size. > params->extra_table will point to a section of memory mapped to the extra table > file, that hasn't changed in size any. params->extra_table != extra_table_ and > the location pointed to by extra_table_ will be unmapped after this routine > finishes (??), but the two memory regions will be direct aliases to one another. sounds correct. The underlying assumption is that the index is memory mapped and follows the format (and files) described in disk_format_v3.h, so the act of re-initializing the IndexTable is a consequence of growing some file and creating another map of the new extension. As such, we should never receive the same pointer that we are currently using (but there's no check for that, because it shouldn't really matter), and we deal with pointers not mapped sections (in fact all the unit tests just allocate memory, never map anything). Aliasing is important in that we don't memcpy data from the old mapping to the new mapping when the file just grows, as that would be a waste of time. So yes, this has a memory interface, but expects the caller to deal with the files as defined by the file format, not just random pointers (hence TestCacheTables::CopyFrom simulating that behavior) But that should be clear from the comment at the top of index_table.h, right? > (I'll probably want a comment clarifying these assumptions somewhere, but I'll > wait to suggest what until I'm certain I understand it all. :-}) > > Assuming that that's correct, why refer to the same data through two different > points (extra_table_ and params->extra_table)? It seems like it's just going to > generate confusion. Do I? I try to refer to the data that logically is being accessed... so for instance line 461 copies the old data somewhere safe and line 463 clears the storage for the new data. Production code could copy the new data and clear the old data, but that would confuse the reader and prevent the unit tests from working :) I mean, the whole thing requires mapping (or a judicious emulation from the caller), but if this code is misusing source or destination somewhere assuming that they are the same, let me know so that it's fixed. So, conceptually, the caller could suspend all requests to this class, copy the file to another file and map the new file (so that there is no aliasing) before calling into this class again asking for re-init. https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:473: DCHECK_LE(extra_bits_, 11); On 2013/11/25 19:48:07, rdsmith wrote: > What results in this 11? This seems like you're simply saying that you can't > have a table_len greater than 32 bits, in which case don't you want to just say > that? (I.e. IIRTCC the 11 is dependent on a specific value of kBaseTableLen, > and that dependency should be explicit in the DCHECK). This is just saying that the table can be doubled 11 times from the smaller table to the biggest table. Yes, the value (11) is derived from kBaseTableLen and maybe kMaxAddress or kCellAddressMask, but I thought it was clearer to have the number of bits here (right when we are saying "how many extra bits can we have") rather to come out with a probably cryptic formula that derives the proper value from other numbers. The cost is of course less flexibility if the format change, but at least it will fail right when it is needed. When I was writing the code this value changed right here multiple times depending on what I had implemented (and was ready to test) so it seemed better to keep it that way instead of tying it to other constants. I can define a local constant here if you don't like it. https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:475: small_table_ = extra_bits_ < kHashShift - kHashSmallTableShift; On 2013/11/25 19:48:07, rdsmith wrote: > In other words, while the number of bits that are thrown away before making the > comparison between the hash and the IndexCell partial hash is greater than the > number of bits used to index into the table, leave the table small, but when we > start using all those thrown-away bits and crowding into the comparison section, > switch to the large table? I didn't intend to make a strong statement about what this means. It really just means extra_bits_ < 6, as in the table can be doubled 6 times before reaching the 64k border between the two formats. It's just that those two values are tightly related to each other, and the only difference between them is the table type (small or large), and subtracting them gives me the 6 that I need. Somewhat similar to the previous '11' but this time I'm tying it to other constants that change accordingly (and as far as I remember I use this same formula in other places). > > (I'll note that this seems like a different metric for the small/large table > distinction that I'd previously understood--I thought it was about whether or > not the entries will always be stored in one file. But looking at > disk_format_v3.h, it seems like the distinction is specified purely by the > number of entries. If there's a subtle background connection between these > three things, it should probably be made explicit, or if one of them isn't > relevant, that should be made clear too. The distinction between small and large is basically arbitrary. 64K is an interesting number, and it is the base table size of V2, so it was just natural to keep using it. A small table is a new feature of V3, so it covers tables smaller than 64K. The maximum number of blocks on a block file also happens to be 64k mostly because 64k is an interesting number (and it doesn't yield files that are too big with the largest block file), and it requires two standard pages (4k each) to hold the bitmap. That results in being able to store up to 64k entries on a single file (in fact, slightly less than that). And that just fits in that it allows the implied file feature of small tables, but it is not the cause of the border (although it would be hard to move that border somewhere else). So things fit, but the "real" distinction between small and large tables is arbitrarily a 64k table. That's why some numbers are really "count how many times we can do foo", given the arbitrary limits of the format (start at 1k, switch at 64k, up to 22 bits) > > Based on how much pain it took me to figure out, I'll probably also want an > explanatory comment about what hashes mean (in the different contexts) > somewhere, but I'll ask for that once I'm confident I actually understand how > they're used.) https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:502: DCHECK_GE(num_words, old_num_words); On 2013/11/25 19:48:07, rdsmith wrote: > Under what circumstances will they be equal? Does that imply that we could be > doubling table size in a situation where the extra_table is the same size as the > added main table size? Mostly a check for not getting a negative index below. We could be doing that, or maybe growing the extra table by 16 entries (so that the number of words doesn't change). Backend tests grow the table really slowly :) https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.h (right): https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:264: int extra_bits_; // How many bits are in mask_ above the default value. On 2013/11/25 19:48:07, rdsmith wrote: > I could certainly be reading the code incorrectly, but this comment seems > incorrect to me. From IndexTable::Init: > > extra_bits_ = base::bits::Log2Floor(header_->table_len) - > base::bits::Log2Floor(kBaseTableLen); > DCHECK_GE(extra_bits_, 0); > DCHECK_LE(extra_bits_, 11); > mask_ = ((kBaseTableLen / kCellsPerBucket) << extra_bits_) - 1; // <<-- > small_table_ = extra_bits_ < kHashShift - kHashSmallTableShift; > if (!small_table_) > extra_bits_ -= kHashShift - kHashSmallTableShift; > > At the marked line, the relationship in the comment seems accurate to me. But > if we don't have a small table, we reduce extra_bits_ by the different in hash > shift between small and large tables, which would seem to make the comment > invalid. What am I missing? nothing :). I could say that "default value" was not defined, and there is a default value for small table and another one for normal tables (which is what happens given that extra_bits is effectively applied against small or normal shifts depending on the type of table)
On 2013/11/25 19:48:07, rdsmith wrote: > Part-way through reviewing Init(), and getting somewhat bogged down in > understanding the use of hash values, so pausing to provide these comments and > ask for clarifications on hash. I believe that: > * The hash for a particular key is 32 bits, generated by base::Hash(key) > * The bottom bits (a variable number of bits, depending on table_len/mask_) are > used as the address of the relevant bucket for that entry. > * The top bits (32 - kHash{SmallTable,}Shift) are stored in the IndexCell and > compared on cell lookup. That shift means that we're comparing just the number > of bits stored in the table for each size; i.e. kHash{SmallTable,}Shift is > dependent on the number of bits stored for the hash in the different types of > table/cells. > * The top and bottom bits in this scheme may overlap. > * When we increase the table size via doubling (i.e. increase the number of > buckets), that may mean that the extra bit (or bits, if we haven't touched a > cell from previous doublings) used in indexing into the table may be wrong, and > the cell may need to be moved. This is what the MisplacedCell() logic is about. Roughly yes. As the table grows, bits move from the "we need to store this part" to the "this part is implied given the position in the table". There is no need to store store less and less bits each time (and complicate all the logic) so that means that each time we double the size, there's another bit that becomes redundant. When we double the table, we move all cells that have to moved to the upper half of the table (about half the cells should move). If that goes well, there will never be a misplaced cell. However, if there is a crash while we are moving cells (even though this is a memory-only operation), we could end up with some cells on the right place and some cells in the wrong place... those misplaced cells are detected by a mismatch on the redundant bits, and can be moved as we see them. > > > That all correct? > > I feel as if both address and hash could benefit from a comment somewhere > indicating that they mean different things in different parts of the code, with > an explanation as to what those differences are (if this is in the design doc, > that's fine, and my apologies for forgetting it). How the hash is used to locate entries on the table (and doubling of the table) is covered on the design doc. But I don't fully follow what you're thinking. thanks.
Updated version https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:250: Addr EntryCell::GetAddress() const { On 2013/11/25 19:48:07, rdsmith wrote: > On 2013/11/18 23:43:34, rvargas wrote: > > On 2013/11/18 20:37:04, rdsmith wrote: > > > On 2013/11/14 02:54:19, rvargas wrote: > > > > We could add more methods to Address to make it aware of small tables (or > > add > > > > more arguments to the new methods), but between the two files I think it > is > > > > better to leak from Address to Index than the other way. I could use > > > > Addr::FromEvictedAddress even for small tables after massaging the value, > > but > > > I > > > > don't think that's what you want. > > > > > > It might be, actually. I agree with you that given that we're going to leak > > > some information in one direction, I'd rather leak from Address to > IndexTable > > > than the other way around (ideally I'd like to not leak at all, but I > haven't > > > come up with any suggestions as to how to do that :-}). But we still have > two > > > different methods on address that look, from where I'm sitting, like they > are > > > targeting the two different formats of address in IndexTable. Looking just > at > > > the addr.h header, it's not clear why we need both the Addr constructor and > > the > > > static methods Addr::From*Address(). So if you'd be comfortable either > > breaking > > > the block portion of address_value out and using the Addr() constructor in > > place > > > of From*Address (which means a bit more leakage of knowledge into > IndexTable, > > > specifically how multi-file address values are split between file # and > block > > #, > > > but we've already gestured at that in our use of the constructor) or the > > reverse > > > (which leaks the same information) I'd personally prefer that, as it'd make > > the > > > Addr interface not have any hints of index table formats in it. > > > > > > Alternatively, documentation in addr.h explaining what the From*Address() > are > > > useful for without (explicitly or implicitly) being dependent on the > > IndexTable > > > format would be fine. > > > > > > And, finally, I'm not fanatic about this--if you really want to leave it > this > > > way, I'm ok with that. But it feels to me as if we're leaking information > > both > > > ways currently, and if we can I'd like to clean that up at least a little > bit. > > > > Done. I'll cleanup Addr if we are happy here. > > Thanks! Yes, this looks much better. Go ahead and cleanup Addr. Done. https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.h (right): https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:205: // for the provided iterators are ignored. On 2013/11/25 19:48:07, rdsmith wrote: > On 2013/11/18 23:43:34, rvargas wrote: > > On 2013/11/18 20:37:04, rdsmith wrote: > > > On 2013/11/14 02:54:19, rvargas wrote: > > > > On 2013/11/13 21:14:12, rdsmith wrote: > > > > > Where is the boundary between low usage and high usage defined? > > > > > > > > That's an implementation detail of the eviction algorithm (eviction_v3.cc) > > > > > > So why is it exposed on a public interface? (Not a question for this CL, > more > > > of an overall design question.) If it's exposed on a public interface, it > > > seems like it should have some sort of, maybe handwavy, definition for the > > > callers of that interface. > > > > I'm not sure I follow. It is not defined in an interface, but clearly the > caller > > should be able to get the data out of the table. What is defined is that each > > entry has a use counter, a fetch counter and a group. (cells don't have fetch > > counter, but have a group). The caller is able to set the group and counter at > > will, retrieve it from a particular cell, and get all cells from all groups > > given a time constrain (this method). The mapping of counters to groups is the > > detail that is private to the eviction code. > > Ah, ok, I think I follow. The category that an entry is in can be set > externally to that entry, but is limited to values in the EntryGroup enum. The > eviction code is the only code that sets those values, and is the only consumer > of this interface. So there's a way in which this interface is hiding consumer > information from the producer, not the other way around. > > Just looking at this interface in isolation, I'd value having a comment > reference indicating that the iterator into which individual entries are placed > depends in the EntryGroup value associated with them. But I could easily > imagine that anyone who was doing any extended work in the cache would recognize > that *_use referred to EntryGroup, and hence wouldn't need the reference. So > I'll raise that as an idea and leave it up to you whether to take it or not. done > > > The only detail about the arguments is that the method doesn't return evicted > > entries, in part because I have not written the code to delete evicted entries > > yet. In that sense, this method provides strictly what the caller needs so > it's > > not really general purpose, but I think that's fine. > > Yeah, I agree. It's also very much targeted at the eviction code--no other > caller will really have any use for it. In an ideal world, there'd be a private > pipe for that information from IndexTable->eviction code so that other consumers > wouldn't be distracted by/have access to the interface, but I think that's way > more trouble than it's worth.
On 2013/11/26 00:48:56, rvargas wrote: > On 2013/11/25 19:48:07, rdsmith wrote: > > Part-way through reviewing Init(), and getting somewhat bogged down in > > understanding the use of hash values, so pausing to provide these comments and > > ask for clarifications on hash. I believe that: > > * The hash for a particular key is 32 bits, generated by base::Hash(key) > > * The bottom bits (a variable number of bits, depending on table_len/mask_) > are > > used as the address of the relevant bucket for that entry. > > * The top bits (32 - kHash{SmallTable,}Shift) are stored in the IndexCell and > > compared on cell lookup. That shift means that we're comparing just the > number > > of bits stored in the table for each size; i.e. kHash{SmallTable,}Shift is > > dependent on the number of bits stored for the hash in the different types of > > table/cells. > > * The top and bottom bits in this scheme may overlap. > > * When we increase the table size via doubling (i.e. increase the number of > > buckets), that may mean that the extra bit (or bits, if we haven't touched a > > cell from previous doublings) used in indexing into the table may be wrong, > and > > the cell may need to be moved. This is what the MisplacedCell() logic is > about. > > Roughly yes. As the table grows, bits move from the "we need to store this part" > to the "this part is implied given the position in the table". There is no need > to store store less and less bits each time (and complicate all the logic) so > that means that each time we double the size, there's another bit that becomes > redundant. > > When we double the table, we move all cells that have to moved to the upper half > of the table (about half the cells should move). If that goes well, there will > never be a misplaced cell. However, if there is a crash while we are moving > cells (even though this is a memory-only operation), we could end up with some > cells on the right place and some cells in the wrong place... those misplaced > cells are detected by a mismatch on the redundant bits, and can be moved as we > see them. That makes sense. Where is the logic that moves the cells on table doubling? Scanning Init(), I only see a call to MoveCells for the old_extra_table cells. Two flags get raised in my brain around crash handling (== unlikely but not impossible corner case) code. The first is that it needs to be tested thoroughly, because our chances of noticing problems in it from more general testing and the field is lower. Quickly skimming the unit test, I'm not sure I see anything that will result in exercising the MisplacedHash/HandleMisplacedHash logic; am I missing it? The second is that code to handle rare corner cases can sometimes result in masking other bugs. It'd be nice if there was some way to disable the misplaced hash logic for the tests that aren't trying to emulate the crashing code. WDYT? (Still working on review; responding to this now just because it's a top-level comment and thus easiest to respond to directly.) > > > > > > > That all correct? > > > > I feel as if both address and hash could benefit from a comment somewhere > > indicating that they mean different things in different parts of the code, > with > > an explanation as to what those differences are (if this is in the design doc, > > that's fine, and my apologies for forgetting it). > > How the hash is used to locate entries on the table (and doubling of the table) > is covered on the design doc. But I don't fully follow what you're thinking. > > thanks.
On 2013/11/26 00:48:56, rvargas wrote: > On 2013/11/25 19:48:07, rdsmith wrote: > > I feel as if both address and hash could benefit from a comment somewhere > > indicating that they mean different things in different parts of the code, > with > > an explanation as to what those differences are (if this is in the design doc, > > that's fine, and my apologies for forgetting it). > > How the hash is used to locate entries on the table (and doubling of the table) > is covered on the design doc. But I don't fully follow what you're thinking. > > thanks. So the design document does cover what the hash is and how it's used to lookup entries (and thank you--it was useful for me to go back to the design doc. I think I get more from it each time I go over it after reading the code, not surprisingly). But my concern is more that the concepts "hash" and "address" mean different things at different places in the code, and I think it's easy for the reader to assume that they mean the same thing, and be confused that way. If they were consistently referred to by different names in the code, that would address my concern. I'm not sure if this should be a high priority--if the concepts are clear, it may be that a code reader will be able to track by context which meaning of hash/address is currently relevant. But that's the concern I was raising. With regard to addresses, I think my confusion is between what I'll call "full addresses" (i.e. what Addr() passes around), and "address values" (what you get from IndexCells and EntryCell::GetAddressValue()). I think it would be clearer if those things used different names consistently. Does this make sense to you? I think that would mean: * Changing the names of {Get,Set}Cell{SmallTable,}Address & {FileNumber,StartBlock}FromAddress, along with the variables used inside those functions? * I think based on its usage in IsValidAddress, kMaxAddress would also become kMaxAddressValue. With regard to hash, similarly there's the portion of the hash that's stored in IndexCell (call it "high hash"), the portion that's implied by the position of the IndexCell within the IndexTable ("low hash"), and the whole thing (gotten from base::Hash() and split up as needed, "full hash"). It looks to me as if what I'm calling the low hash is generally referred to as a bucket hash, so it may be the only real confusion is between high hash and full hash (for example, the EntryCell constructor takes a full hash and passes a high hash to SetCellHash, storing the full hash in hash_). I'm not sure what would be involved in clearing this up or if it's worth it, but I suspect that just using different words for the high hash and the full hash (whatever they are) would reduce confusion in code readers.
On 2013/12/02 18:53:42, rdsmith wrote: > On 2013/11/26 00:48:56, rvargas wrote: > > On 2013/11/25 19:48:07, rdsmith wrote: > > > Part-way through reviewing Init(), and getting somewhat bogged down in > > > understanding the use of hash values, so pausing to provide these comments > and > > > ask for clarifications on hash. I believe that: > > > * The hash for a particular key is 32 bits, generated by base::Hash(key) > > > * The bottom bits (a variable number of bits, depending on table_len/mask_) > > are > > > used as the address of the relevant bucket for that entry. > > > * The top bits (32 - kHash{SmallTable,}Shift) are stored in the IndexCell > and > > > compared on cell lookup. That shift means that we're comparing just the > > number > > > of bits stored in the table for each size; i.e. kHash{SmallTable,}Shift is > > > dependent on the number of bits stored for the hash in the different types > of > > > table/cells. > > > * The top and bottom bits in this scheme may overlap. > > > * When we increase the table size via doubling (i.e. increase the number of > > > buckets), that may mean that the extra bit (or bits, if we haven't touched a > > > cell from previous doublings) used in indexing into the table may be wrong, > > and > > > the cell may need to be moved. This is what the MisplacedCell() logic is > > about. > > > > Roughly yes. As the table grows, bits move from the "we need to store this > part" > > to the "this part is implied given the position in the table". There is no > need > > to store store less and less bits each time (and complicate all the logic) so > > that means that each time we double the size, there's another bit that becomes > > redundant. > > > > When we double the table, we move all cells that have to moved to the upper > half > > of the table (about half the cells should move). If that goes well, there will > > never be a misplaced cell. However, if there is a crash while we are moving > > cells (even though this is a memory-only operation), we could end up with some > > cells on the right place and some cells in the wrong place... those misplaced > > cells are detected by a mismatch on the redundant bits, and can be moved as we > > see them. > > That makes sense. Where is the logic that moves the cells on table doubling? > Scanning Init(), I only see a call to MoveCells for the old_extra_table cells. At the end of Init(), there is a call to MoveCells(). MoveSingleCell() goes through a loop and calls MoveSingleCell() when needed... note that the function goes through main_table_, even though it needs the caller to provide a pointer to the old_extra_table. > > Two flags get raised in my brain around crash handling (== unlikely but not > impossible corner case) code. The first is that it needs to be tested > thoroughly, because our chances of noticing problems in it from more general > testing and the field is lower. Quickly skimming the unit test, I'm not sure I > see anything that will result in exercising the > MisplacedHash/HandleMisplacedHash logic; am I missing it? > > The second is that code to handle rare corner cases can sometimes result in > masking other bugs. It'd be nice if there was some way to disable the misplaced > hash logic for the tests that aren't trying to emulate the crashing code. > The tests are not going through inconsistent states or checks at this point... some of the checks performed will crash the test if they fail (because most tests run without a backend)... and the tests themselves are not generating bad data (but I must admit that was a manual process... as in they did for a while, and then some step of the test will fail, as not being able to locate something). I'm not too worried at this point with error recovery paths not covered by tests; I have not even written all those paths to start with, and it is more important to bring the whole system up and start running unit tests of the whole thing. > WDYT? > > (Still working on review; responding to this now just because it's a top-level > comment and thus easiest to respond to directly.) > > > > > > > > > > > > That all correct? > > > > > > I feel as if both address and hash could benefit from a comment somewhere > > > indicating that they mean different things in different parts of the code, > > with > > > an explanation as to what those differences are (if this is in the design > doc, > > > that's fine, and my apologies for forgetting it). > > > > How the hash is used to locate entries on the table (and doubling of the > table) > > is covered on the design doc. But I don't fully follow what you're thinking. > > > > thanks.
Next round of comments. Made it all the way through index_table.cc this time! :-} https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:454: DCHECK_GE(extra_size, 0); On 2013/11/26 00:32:41, rvargas wrote: > On 2013/11/25 19:48:07, rdsmith wrote: > > Does this mean we always have to incrementally grow the extra table in between > > doublings of the main table? Why put that restriction on it? > > Nope. It can be zero. Huh. I guess I had a braino; dunno why I thought it was GT. https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:463: memset(params->extra_table, 0, extra_size * sizeof(IndexBucket)); On 2013/11/26 00:32:41, rvargas wrote: > On 2013/11/25 19:48:07, rdsmith wrote: > > It feels to me like there are assumptions in this code about aliasing, but I'm > > not certain what they are. Let me go over what I believe the context for > this > > code is and make comments conditional on that. > > > > I believe that this function will be called in only three situations: > > * First initialization. main_table_ will be null, params->main_table will > point > > to memory mapped from a file, and params->extra_table will be null. > > params->extra_table should not be null. So what's the size of the extra_table on init? Just the header? > > > * Growth of extra table. params->main_table will be NULL. > params->extra_table > > != extra_table_, but will point to a memory mapped region of the same file, > only > > larger. Thus any changes made to params->extra_table will be reflected in > > extra_table_ and vice versa (for the length that extra_table_ is mapped for). > > extra_table_ will be unmapped after this function returns (does the calling > code > > have it's own pointer to extra_table_?) > > Yes, someone will know how to unmap the old table after this method returns. > > > * Doubling of main table. params->main_table != main_table_, but will point > to > > a memory mapped region of the same file, only doubled in size. > > params->extra_table will point to a section of memory mapped to the extra > table > > file, that hasn't changed in size any. params->extra_table != extra_table_ > and > > the location pointed to by extra_table_ will be unmapped after this routine > > finishes (??), but the two memory regions will be direct aliases to one > another. > > sounds correct. > > The underlying assumption is that the index is memory mapped and follows the > format (and files) described in disk_format_v3.h, so the act of re-initializing > the IndexTable is a consequence of growing some file and creating another map of > the new extension. As such, we should never receive the same pointer that we are > currently using (but there's no check for that, because it shouldn't really > matter), and we deal with pointers not mapped sections (in fact all the unit > tests just allocate memory, never map anything). > > Aliasing is important in that we don't memcpy data from the old mapping to the > new mapping when the file just grows, as that would be a waste of time. So yes, > this has a memory interface, but expects the caller to deal with the files as > defined by the file format, not just random pointers (hence > TestCacheTables::CopyFrom simulating that behavior) > > But that should be clear from the comment at the top of index_table.h, right? Re-reading that comment, I find it more abstract that I'd like; I'd rather the aliasing assumptions were made explicit, either in the comment to this function or in the comments to IndexTableInitData. But see later. > > (I'll probably want a comment clarifying these assumptions somewhere, but I'll > > wait to suggest what until I'm certain I understand it all. :-}) > > > > Assuming that that's correct, why refer to the same data through two different > > points (extra_table_ and params->extra_table)? It seems like it's just going > to > > generate confusion. > > Do I? I try to refer to the data that logically is being accessed... so for > instance line 461 copies the old data somewhere safe and line 463 clears the > storage for the new data. Production code could copy the new data and clear the > old data, but that would confuse the reader and prevent the unit tests from > working :) > > I mean, the whole thing requires mapping (or a judicious emulation from the > caller), but if this code is misusing source or destination somewhere assuming > that they are the same, let me know so that it's fixed. > > So, conceptually, the caller could suspend all requests to this class, copy the > file to another file and map the new file (so that there is no aliasing) before > calling into this class again asking for re-init. Ah! Thank you. That's a (hypothetical) use case that's very useful for me. Can we put in a comment (before this function or in IndexTableInitData) that if this function is being called a second time, the caller of Init guarantees that the data pointed to by params->main_table will be an improper (i.e. possibly identical) super-set of what was previously passed, and that if main_table isn't being grown, params->extra_table will similarly be a super-set of what was previously passed? I'd also like to have the callers responsibility to deallocate the memory that main_table_ and extra_table_ used to point to made explicit. If one isn't thinking about this function in terms of memory mapped files, it's somewhat surprising; it effectively means that the caller owns that memory, which I don't think is the default assumption one would have looking at the call. https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:473: DCHECK_LE(extra_bits_, 11); On 2013/11/26 00:32:41, rvargas wrote: > On 2013/11/25 19:48:07, rdsmith wrote: > > What results in this 11? This seems like you're simply saying that you can't > > have a table_len greater than 32 bits, in which case don't you want to just > say > > that? (I.e. IIRTCC the 11 is dependent on a specific value of kBaseTableLen, > > and that dependency should be explicit in the DCHECK). > > This is just saying that the table can be doubled 11 times from the smaller > table to the biggest table. Yes, the value (11) is derived from kBaseTableLen > and maybe kMaxAddress or kCellAddressMask, but I thought it was clearer to have > the number of bits here (right when we are saying "how many extra bits can we > have") rather to come out with a probably cryptic formula that derives the > proper value from other numbers. The cost is of course less flexibility if the > format change, but at least it will fail right when it is needed. When I was > writing the code this value changed right here multiple times depending on what > I had implemented (and was ready to test) so it seemed better to keep it that > way instead of tying it to other constants. Huh. I don't particularly care about future flexibility (I mean, future flexibility is good, just not my highest current priority :-}). But from my perspective in reading the code, having the cryptic formulas calls out to the reader what the dependencies between the different constants are. I had been thinking of asking if we could have constants defined in disk_format_v3.h for the different pieces of the IndexCell, and then define the other constants that need to match those based on those constants. It means if you're tracing the "Why is this number this value?" question, you get roadsigns. Can you say more about why having an 11 is clearer? I'm not getting it. It doesn't feel like it tells the code reader/maintainer/debugger anything. > I can define a local constant here if you don't like it. I'm not interested in "const uint32 kExtraBitsLimit = 11;", if that's what you're offering; I don't find that any more useful than just using 11. But I'd be interested in your thoughts about my general argument above. https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:475: small_table_ = extra_bits_ < kHashShift - kHashSmallTableShift; On 2013/11/26 00:32:41, rvargas wrote: > On 2013/11/25 19:48:07, rdsmith wrote: > > In other words, while the number of bits that are thrown away before making > the > > comparison between the hash and the IndexCell partial hash is greater than the > > number of bits used to index into the table, leave the table small, but when > we > > start using all those thrown-away bits and crowding into the comparison > section, > > switch to the large table? > > I didn't intend to make a strong statement about what this means. It really just > means extra_bits_ < 6, as in the table can be doubled 6 times before reaching > the 64k border between the two formats. > > It's just that those two values are tightly related to each other, and the only > difference between them is the table type (small or large), and subtracting them > gives me the 6 that I need. Somewhat similar to the previous '11' but this time > I'm tying it to other constants that change accordingly (and as far as I > remember I use this same formula in other places). > > > > > > (I'll note that this seems like a different metric for the small/large table > > distinction that I'd previously understood--I thought it was about whether or > > not the entries will always be stored in one file. But looking at > > disk_format_v3.h, it seems like the distinction is specified purely by the > > number of entries. If there's a subtle background connection between these > > three things, it should probably be made explicit, or if one of them isn't > > relevant, that should be made clear too. > > The distinction between small and large is basically arbitrary. 64K is an > interesting number, and it is the base table size of V2, so it was just natural > to keep using it. A small table is a new feature of V3, so it covers tables > smaller than 64K. > > The maximum number of blocks on a block file also happens to be 64k mostly > because 64k is an interesting number (and it doesn't yield files that are too > big with the largest block file), and it requires two standard pages (4k each) > to hold the bitmap. > > That results in being able to store up to 64k entries on a single file (in fact, > slightly less than that). And that just fits in that it allows the implied file > feature of small tables, but it is not the cause of the border (although it > would be hard to move that border somewhere else). > > So things fit, but the "real" distinction between small and large tables is > arbitrarily a 64k table. That's why some numbers are really "count how many > times we can do foo", given the arbitrary limits of the format (start at 1k, > switch at 64k, up to 22 bits) > > > > > Based on how much pain it took me to figure out, I'll probably also want an > > explanatory comment about what hashes mean (in the different contexts) > > somewhere, but I'll ask for that once I'm confident I actually understand how > > they're used.) > This makes sense, but mostly comes down to my wanting to be able to trace dependencies between constants/distinctions. I.e. if the defining difference between a small and a large table is the 64K boundary, I'd love to have (e.g.) a constant somewhere kMaxSmallTableSize = 64K, and have (e.g.) the difference between kHashShift and kHashSmallTable shift defined in terms of it. (I.e. this is the same comment as the previous.) https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:485: // creating the bitmaps, clear the part of the extra table. On 2013/11/25 19:48:07, rdsmith wrote: > nit, suggestion: "clear the part of the bitmap referring to the extra table"? Done. https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:492: int main_table_bit_words = ((mask_ >> 1) + 1) * kCellsPerBucket / 32; This looks to be a factor of two off from the size of the new main table (buckets indexed from 0..mask_, total num buckets = mask_+1, mask_>>1+1 = 1/2 the total number of buckets, multiply by cells per bucket gives you half the total number of cells), by which I infer it refers to the old main table. Maybe change the variable name to old_main_table_bit_words and/or get it from the backup_header_? https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:497: DCHECK(growing); Just confirming that this is redundant with if (old_extra_table) because we can't be doubling the table if this is the first time Init() is called. (Not objecting, just trying to understand the failure mode being guarded against.) https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:499: DCHECK_GT(old_num_words, main_table_bit_words); Any reason this couldn't be equal? (I.e. if we're doubling without having any extra table in between). https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:515: backup_bitmap_storage_.swap(storage); What failure is being guarded against by copying and swapping instead of editing in place? https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:519: backup_header_.reset(params->backup_header.release()); What's the context for these parameters? They're only used on the first call to Init(); are they defined on other calls? And where do they come from on the first call to Init()? What do they contain? It feels a bit weird to have this function responsible for allocating storage on growing, but not on first initialization. https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:586: break; nit, suggestion: do { } while(bucket_id); would communicate this loop's purpose to me a bit more clearly. https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:600: for (; !found;) { This looks redundant with the "if (found) break;" below; I don't think you need both. (My nit/suggestion above about do/while would also apply here if you liked it above.) https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:669: if (state == ENTRY_FREE) { Nit, suggestion: If this is a switch statement without a default: label, the compiler will automatically warn if there isn't a label for each entry in the enum. I've found this a useful technique for making sure as code evolves that all appropriate state conditionals are tested. https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:839: kNumExtraBlocks / 4 : kNumExtraBlocks; What is kNumExtraBlocks? I didn't see any comment at the definition site, and it's only used there and here. Any rational for the /4 or =, or is that arbitrary? https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:922: header()->max_bucket = mask_; nit: This seems like a strange place to do this assignment. My assumption, when I saw the call to MoveCells() in Init(), was that all it did was move the cells, but this is a global state change. To me, it would more naturally be paired with the zeroing of extra_table; that's the point at which the extra_table entries become unavailable to the normal lookup (to be made available again in this routine as the cells are put in one at a time). Also, it occurs to me reading this code: Where do we fix up the bucket chains that are left dangling by folding the extra table into the main table on a doubling? https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:997: small_table_ = true; Willing to put a comment in here somewhere about why the flipping back and forth of small_table_ over the course of this algorithm?
On 2013/12/02 19:28:01, rdsmith wrote: > On 2013/11/26 00:48:56, rvargas wrote: > > On 2013/11/25 19:48:07, rdsmith wrote: > > > I feel as if both address and hash could benefit from a comment somewhere > > > indicating that they mean different things in different parts of the code, > > with > > > an explanation as to what those differences are (if this is in the design > doc, > > > that's fine, and my apologies for forgetting it). > > > > How the hash is used to locate entries on the table (and doubling of the > table) > > is covered on the design doc. But I don't fully follow what you're thinking. > > > > thanks. > > So the design document does cover what the hash is and how it's used to lookup > entries (and thank you--it was useful for me to go back to the design doc. I > think I get more from it each time I go over it after reading the code, not > surprisingly). But my concern is more that the concepts "hash" and "address" > mean different things at different places in the code, and I think it's easy for > the reader to assume that they mean the same thing, and be confused that way. > If they were consistently referred to by different names in the code, that would > address my concern. I'm not sure if this should be a high priority--if the > concepts are clear, it may be that a code reader will be able to track by > context which meaning of hash/address is currently relevant. But that's the > concern I was raising. > > With regard to addresses, I think my confusion is between what I'll call "full > addresses" (i.e. what Addr() passes around), and "address values" (what you get > from IndexCells and EntryCell::GetAddressValue()). I think it would be clearer > if those things used different names consistently. Does this make sense to you? > I think that would mean: > * Changing the names of {Get,Set}Cell{SmallTable,}Address & > {FileNumber,StartBlock}FromAddress, along with the variables used > inside those functions? > * I think based on its usage in IsValidAddress, kMaxAddress would also > become kMaxAddressValue. Sure, as I said before, this can be improved by a different name but I have no good ideas about what to call it. So here's the whole picture: disk_cache::Addr -> this is the regular address passed around most of the code (I mean, this is visible to most files under disk_cache), and it is the general address of anything (note that the type is not Address) disk_cache::Addr::value == disk_cache::CacheAddr == uint32. This is what is stored inside an Addr, and it is just a number of a fixed length. I mentioned because it has a dedicated type name (and some code uses it, of course), and because it "has" the nickname of "address value" as it is accessible as that from any Addr. And now the new neighbors: The address of a Cell -> currently also known as "address", "defined" on the file format as that, and with a variable bit length. Originally meant to be basically the cell number, or cell index (but index is a very bad name). Should be fairly private to this file (index_table.cc) With the added twist that once you have an EntyCell, given that it is meant to be used from the outside, it must provide a for of address, and so it translates from cell number to Addr (so Address() returns Addr). But on the other hand, it cannot hide the short version from other code _from this file_, as this file is the one who is aware of the details of how many bits go where, so it provides private methods to get to the "real" cell number. As such, I would not want to go to AddressValue everywhere (even though today that's the closest I have to be able to tell that it means a cell number), because it is too close to Addr::value. I would be happier if what is stored inside a Cell is a Fraby and so Address() returns Addr and GetFraby returns Fraby. The caveat of course is that there is a very simple translation from Fraby to CacheAddr. How about Location for anything that relates to the cell number? > > With regard to hash, similarly there's the portion of the hash that's stored in > IndexCell (call it "high hash"), the portion that's implied by the position of > the IndexCell within the IndexTable ("low hash"), and the whole thing (gotten > from base::Hash() and split up as needed, "full hash"). It looks to me as if > what I'm calling the low hash is generally referred to as a bucket hash, so it > may be the only real confusion is between high hash and full hash (for example, > the EntryCell constructor takes a full hash and passes a high hash to > SetCellHash, storing the full hash in hash_). I'm not sure what would be > involved in clearing this up or if it's worth it, but I suspect that just using > different words for the high hash and the full hash (whatever they are) would > reduce confusion in code readers. Ah, now I see what you meant. On the same lines, I could rename the thing that lives inside a cell as "id" as in it is what identifies whatever is stored in that cell. It may be a little weird how the id is derived from part of the hash (versus it _is_ part of the hash), but the cell doesn't really care what is there (not that it cares about any other field, but I digress) uint32 IndexTable::GetFullHash(const IndexCell& cell, uint32 lower_part) { // It is OK for the high order bits of lower_part to overlap with the stored // part of the hash. if (small_table_) return (GetCellSmallTableId(cell) << kHashSmallTableShift) | lower_part; return (GetCellId(cell) << kHashShift) | lower_part; } That would change bucket_hash to something else (and bucket_id, cell_id)... maybe bucket_hash becomes bucket_id.
Thanks again https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:463: memset(params->extra_table, 0, extra_size * sizeof(IndexBucket)); On 2013/12/02 21:48:06, rdsmith wrote: > On 2013/11/26 00:32:41, rvargas wrote: > > On 2013/11/25 19:48:07, rdsmith wrote: > > > It feels to me like there are assumptions in this code about aliasing, but > I'm > > > not certain what they are. Let me go over what I believe the context for > > this > > > code is and make comments conditional on that. > > > > > > I believe that this function will be called in only three situations: > > > * First initialization. main_table_ will be null, params->main_table will > > point > > > to memory mapped from a file, and params->extra_table will be null. > > > > params->extra_table should not be null. > > So what's the size of the extra_table on init? Just the header? There is no header on the extra table, but it has to have some space there to be able to store something. The reason is that no matter how big the main table is, it may take as little as 5 entries to require space on the extra table. Backend unit tests start with something like 8 cells, and production code starts with 1/8th of the main table. > > > > > > * Growth of extra table. params->main_table will be NULL. > > params->extra_table > > > != extra_table_, but will point to a memory mapped region of the same file, > > only > > > larger. Thus any changes made to params->extra_table will be reflected in > > > extra_table_ and vice versa (for the length that extra_table_ is mapped > for). > > > extra_table_ will be unmapped after this function returns (does the calling > > code > > > have it's own pointer to extra_table_?) > > > > Yes, someone will know how to unmap the old table after this method returns. > > > > > * Doubling of main table. params->main_table != main_table_, but will point > > to > > > a memory mapped region of the same file, only doubled in size. > > > params->extra_table will point to a section of memory mapped to the extra > > table > > > file, that hasn't changed in size any. params->extra_table != extra_table_ > > and > > > the location pointed to by extra_table_ will be unmapped after this routine > > > finishes (??), but the two memory regions will be direct aliases to one > > another. > > > > sounds correct. > > > > The underlying assumption is that the index is memory mapped and follows the > > format (and files) described in disk_format_v3.h, so the act of > re-initializing > > the IndexTable is a consequence of growing some file and creating another map > of > > the new extension. As such, we should never receive the same pointer that we > are > > currently using (but there's no check for that, because it shouldn't really > > matter), and we deal with pointers not mapped sections (in fact all the unit > > tests just allocate memory, never map anything). > > > > Aliasing is important in that we don't memcpy data from the old mapping to the > > new mapping when the file just grows, as that would be a waste of time. So > yes, > > this has a memory interface, but expects the caller to deal with the files as > > defined by the file format, not just random pointers (hence > > TestCacheTables::CopyFrom simulating that behavior) > > > > But that should be clear from the comment at the top of index_table.h, right? > > Re-reading that comment, I find it more abstract that I'd like; I'd rather the > aliasing assumptions were made explicit, either in the comment to this function > or in the comments to IndexTableInitData. But see later. I'm a little worried about this comment. The header mentions multiple files that can be remapped while the cache is working and that management is external to this file, so it looks pretty concrete to me. Without the files being mmapped, the whole IndexTable would be completely impractical so that's not exactly a detail. I'll extend the comment for Init() > > > > (I'll probably want a comment clarifying these assumptions somewhere, but > I'll > > > wait to suggest what until I'm certain I understand it all. :-}) > > > > > > Assuming that that's correct, why refer to the same data through two > different > > > points (extra_table_ and params->extra_table)? It seems like it's just > going > > to > > > generate confusion. > > > > Do I? I try to refer to the data that logically is being accessed... so for > > instance line 461 copies the old data somewhere safe and line 463 clears the > > storage for the new data. Production code could copy the new data and clear > the > > old data, but that would confuse the reader and prevent the unit tests from > > working :) > > > > I mean, the whole thing requires mapping (or a judicious emulation from the > > caller), but if this code is misusing source or destination somewhere assuming > > that they are the same, let me know so that it's fixed. > > > > So, conceptually, the caller could suspend all requests to this class, copy > the > > file to another file and map the new file (so that there is no aliasing) > before > > calling into this class again asking for re-init. > > Ah! Thank you. That's a (hypothetical) use case that's very useful for me. > Can we put in a comment (before this function or in IndexTableInitData) that if > this function is being called a second time, the caller of Init guarantees that > the data pointed to by params->main_table will be an improper (i.e. possibly > identical) super-set of what was previously passed, and that if main_table isn't > being grown, params->extra_table will similarly be a super-set of what was > previously passed? > > I'd also like to have the callers responsibility to deallocate the memory that > main_table_ and extra_table_ used to point to made explicit. If one isn't > thinking about this function in terms of memory mapped files, it's somewhat > surprising; it effectively means that the caller owns that memory, which I don't > think is the default assumption one would have looking at the call. > The memory ownership should be clear by the definition of IndexTableInitData. There is ownership transfer for the backups, but not for the bitmap or tables. That's the (relatively new) coding standard so it feels a little weird if that is repeated by comments. https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:473: DCHECK_LE(extra_bits_, 11); On 2013/12/02 21:48:06, rdsmith wrote: > On 2013/11/26 00:32:41, rvargas wrote: > > On 2013/11/25 19:48:07, rdsmith wrote: > > > What results in this 11? This seems like you're simply saying that you > can't > > > have a table_len greater than 32 bits, in which case don't you want to just > > say > > > that? (I.e. IIRTCC the 11 is dependent on a specific value of > kBaseTableLen, > > > and that dependency should be explicit in the DCHECK). > > > > This is just saying that the table can be doubled 11 times from the smaller > > table to the biggest table. Yes, the value (11) is derived from kBaseTableLen > > and maybe kMaxAddress or kCellAddressMask, but I thought it was clearer to > have > > the number of bits here (right when we are saying "how many extra bits can we > > have") rather to come out with a probably cryptic formula that derives the > > proper value from other numbers. The cost is of course less flexibility if the > > format change, but at least it will fail right when it is needed. When I was > > writing the code this value changed right here multiple times depending on > what > > I had implemented (and was ready to test) so it seemed better to keep it that > > way instead of tying it to other constants. > > Huh. I don't particularly care about future flexibility (I mean, future > flexibility is good, just not my highest current priority :-}). But from my > perspective in reading the code, having the cryptic formulas calls out to the > reader what the dependencies between the different constants are. I had been > thinking of asking if we could have constants defined in disk_format_v3.h for > the different pieces of the IndexCell, and then define the other constants that > need to match those based on those constants. It means if you're tracing the > "Why is this number this value?" question, you get roadsigns. > > Can you say more about why having an 11 is clearer? I'm not getting it. It > doesn't feel like it tells the code reader/maintainer/debugger anything. > > > I can define a local constant here if you don't like it. > > I'm not interested in "const uint32 kExtraBitsLimit = 11;", if that's what > you're offering; I don't find that any more useful than just using 11. But I'd > be interested in your thoughts about my general argument above. > > My best example would be the failure of the formula two lines below this one. I can come up with something like the log of the largest table divided by the smallest table, but then that rises the issue of deriving the largest table. I hear what you are saying and I generally agree, to the point of having a bunch of constants with meaningful names whose values can be derived easily by just inspection (the reason for the constants here). In this particular case, we have a line that calculates a number from the comparison of the current table versus the smallest table. The first check says we cannot have a negative number (a table smaller than the smallest), the second line checks the value against an upper bound, and that should tell right away that we are checking for the largest possible table (right?) The only question is then why the largest table is 11 times larger than the smallest table, or in other words, what is the size of the largest table. Honestly, if I'm reading some code that I didn't write, I would not care much about the value of that number (the actual limit for the test): I would mostly care for some limit being there, either on a check, or in the code that uses that number. In this case, it would mean to me the same if that value is 8 or 15, or derived from a formula. 8 could be wrong, the formula could be off by one and in any case I would not spend time looking at the value unless I'm actually trying to debug a problem and I have reason to suspect that the max value could be at fault. And if I'm doing that, I would probably ignore the formula, figure out what the value should be and compare that against what the code says (number of bits on an address = 22, so 21 of them go to the main table; smaller table = 1024 (10 bits), so delta = 21 - 10) And that's why I prefer a plain number here rather than a formula... given that too much may be read from the formula (as in the check against 6). What I think it matters is not the value, but the check itself (double check an upper limit, whatever that is) I hope I don't sound combative, that's not my intention. I just want to answer the question before we setup in what to do. I'm up for moving relevant constants to the file format header. That's why kBaseTableLen is there. Shifts and masks I'm not so sure as they make more sense with the code that uses them, and I don't want to make them visible to any code outside this file. https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:492: int main_table_bit_words = ((mask_ >> 1) + 1) * kCellsPerBucket / 32; On 2013/12/02 21:48:06, rdsmith wrote: > This looks to be a factor of two off from the size of the new main table > (buckets indexed from 0..mask_, total num buckets = mask_+1, mask_>>1+1 = 1/2 > the total number of buckets, multiply by cells per bucket gives you half the > total number of cells), by which I infer it refers to the old main table. Maybe > change the variable name to old_main_table_bit_words and/or get it from the > backup_header_? Done. https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:497: DCHECK(growing); On 2013/12/02 21:48:06, rdsmith wrote: > Just confirming that this is redundant with if (old_extra_table) because we > can't be doubling the table if this is the first time Init() is called. (Not > objecting, just trying to understand the failure mode being guarded against.) correct. Protects against the first part of this function :) https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:499: DCHECK_GT(old_num_words, main_table_bit_words); On 2013/12/02 21:48:06, rdsmith wrote: > Any reason this couldn't be equal? (I.e. if we're doubling without having any > extra table in between). An extra table is always required. https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:515: backup_bitmap_storage_.swap(storage); On 2013/12/02 21:48:06, rdsmith wrote: > What failure is being guarded against by copying and swapping instead of editing > in place? The bitmap itself is supposed to be growing so we need a bigger backup https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:519: backup_header_.reset(params->backup_header.release()); On 2013/12/02 21:48:06, rdsmith wrote: > What's the context for these parameters? They're only used on the first call to > Init(); are they defined on other calls? And where do they come from on the > first call to Init()? What do they contain? It feels a bit weird to have this > function responsible for allocating storage on growing, but not on first > initialization. There are dedicated files to store the backups. When the cache starts, those files are read and the contents are provided to this method, so that we can check the backup against the mmaped bitmap. When growing, there is no need to read any file because the up to date version is already in memory, owned by this object. https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:586: break; On 2013/12/02 21:48:06, rdsmith wrote: > nit, suggestion: do { } while(bucket_id); would communicate this loop's purpose > to me a bit more clearly. Done. https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:600: for (; !found;) { On 2013/12/02 21:48:06, rdsmith wrote: > This looks redundant with the "if (found) break;" below; I don't think you need > both. (My nit/suggestion above about do/while would also apply here if you > liked it above.) Done. https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:669: if (state == ENTRY_FREE) { On 2013/12/02 21:48:06, rdsmith wrote: > Nit, suggestion: If this is a switch statement without a default: label, the > compiler will automatically warn if there isn't a label for each entry in the > enum. I've found this a useful technique for making sure as code evolves that > all appropriate state conditionals are tested. Done. https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:839: kNumExtraBlocks / 4 : kNumExtraBlocks; On 2013/12/02 21:48:06, rdsmith wrote: > What is kNumExtraBlocks? I didn't see any comment at the definition site, and > it's only used there and here. Any rational for the /4 or =, or is that > arbitrary? Completely arbitrary (well, /4 is tied to *2). kNumExtraBlocks is just how many extra blocks should be allocated each time the extra table grows. So this says, for tables larger than 2048, add 256 bloacks each time. Otherwise, add 1024 blocks. So this means the tables total size goes 1024 - 1024+256 - 1024+512 - 1024+768 - 2048 - 3072 - 4096... or something like that (because extra doesn't really start at 0) btw, kNumExtraBlocks also directs how slowly each block file grows. https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:922: header()->max_bucket = mask_; On 2013/12/02 21:48:06, rdsmith wrote: > nit: This seems like a strange place to do this assignment. My assumption, when > I saw the call to MoveCells() in Init(), was that all it did was move the cells, > but this is a global state change. To me, it would more naturally be paired > with the zeroing of extra_table; that's the point at which the extra_table > entries become unavailable to the normal lookup (to be made available again in > this routine as the cells are put in one at a time). Sort of. The state of this object is in flux while Init is running, and this method only runs inside Init, so is not like the internal state should be consistent when this method starts. This method receives a pointer to the extra table because the core of Init is switching tables and bitmaps and clearing buffers for this method. But it seems a little silly to clear a member in Init only to pass its value as an explicit argument somewhere else that would be in charge of setting the final value for the member. I mean, the final value of max_bucket is only stable when MoveCells completes so it seemed better to delegate all actions related to that field to this method. (as always, let me know if you feel strongly about doing that on Init) > > Also, it occurs to me reading this code: Where do we fix up the bucket chains > that are left dangling by folding the extra table into the main table on a > doubling? I don't follow the question. We take a chain one at a time and follow it to the end, moving some entries from the main table and all entries from the extra table (to the new extra table) https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:997: small_table_ = true; On 2013/12/02 21:48:06, rdsmith wrote: > Willing to put a comment in here somewhere about why the flipping back and forth > of small_table_ over the course of this algorithm? yeah, not so happy about having to do that. Added a comment.
On 2013/12/02 21:01:49, rvargas wrote: > On 2013/12/02 18:53:42, rdsmith wrote: > > On 2013/11/26 00:48:56, rvargas wrote: > > > On 2013/11/25 19:48:07, rdsmith wrote: > > > > Part-way through reviewing Init(), and getting somewhat bogged down in > > > > understanding the use of hash values, so pausing to provide these comments > > and > > > > ask for clarifications on hash. I believe that: > > > > * The hash for a particular key is 32 bits, generated by base::Hash(key) > > > > * The bottom bits (a variable number of bits, depending on > table_len/mask_) > > > are > > > > used as the address of the relevant bucket for that entry. > > > > * The top bits (32 - kHash{SmallTable,}Shift) are stored in the IndexCell > > and > > > > compared on cell lookup. That shift means that we're comparing just the > > > number > > > > of bits stored in the table for each size; i.e. kHash{SmallTable,}Shift is > > > > dependent on the number of bits stored for the hash in the different types > > of > > > > table/cells. > > > > * The top and bottom bits in this scheme may overlap. > > > > * When we increase the table size via doubling (i.e. increase the number > of > > > > buckets), that may mean that the extra bit (or bits, if we haven't touched > a > > > > cell from previous doublings) used in indexing into the table may be > wrong, > > > and > > > > the cell may need to be moved. This is what the MisplacedCell() logic is > > > about. > > > > > > Roughly yes. As the table grows, bits move from the "we need to store this > > part" > > > to the "this part is implied given the position in the table". There is no > > need > > > to store store less and less bits each time (and complicate all the logic) > so > > > that means that each time we double the size, there's another bit that > becomes > > > redundant. > > > > > > When we double the table, we move all cells that have to moved to the upper > > half > > > of the table (about half the cells should move). If that goes well, there > will > > > never be a misplaced cell. However, if there is a crash while we are moving > > > cells (even though this is a memory-only operation), we could end up with > some > > > cells on the right place and some cells in the wrong place... those > misplaced > > > cells are detected by a mismatch on the redundant bits, and can be moved as > we > > > see them. > > > > That makes sense. Where is the logic that moves the cells on table doubling? > > Scanning Init(), I only see a call to MoveCells for the old_extra_table cells. > > > At the end of Init(), there is a call to MoveCells(). MoveSingleCell() goes > through a loop and calls MoveSingleCell() when needed... note that the function > goes through main_table_, even though it needs the caller to provide a pointer > to the old_extra_table. Right--I had incorrectly thought that MoveCells() was only about the extra table. Thanks for the pointer. > > > > > > Two flags get raised in my brain around crash handling (== unlikely but not > > impossible corner case) code. The first is that it needs to be tested > > thoroughly, because our chances of noticing problems in it from more general > > testing and the field is lower. Quickly skimming the unit test, I'm not sure > I > > see anything that will result in exercising the > > MisplacedHash/HandleMisplacedHash logic; am I missing it? > > > > The second is that code to handle rare corner cases can sometimes result in > > masking other bugs. It'd be nice if there was some way to disable the > misplaced > > hash logic for the tests that aren't trying to emulate the crashing code. > > > > The tests are not going through inconsistent states or checks at this point... > some of the checks performed will crash the test if they fail (because most > tests run without a backend)... and the tests themselves are not generating bad > data (but I must admit that was a manual process... as in they did for a while, > and then some step of the test will fail, as not being able to locate > something). > > I'm not too worried at this point with error recovery paths not covered by > tests; I have not even written all those paths to start with, and it is more > important to bring the whole system up and start running unit tests of the whole > thing. I'm happy to go with postponing corner case testing until later in the process, but I would sorta like to keep to the invariant that code has tests associated with it when it's committed. Does that make sense to you? Any easy way to add tests for the misplaced cell logic? > > > > WDYT? > > > > (Still working on review; responding to this now just because it's a top-level > > comment and thus easiest to respond to directly.) > > > > > > > > > > > > > > > > > That all correct? > > > > > > > > I feel as if both address and hash could benefit from a comment somewhere > > > > indicating that they mean different things in different parts of the code, > > > with > > > > an explanation as to what those differences are (if this is in the design > > doc, > > > > that's fine, and my apologies for forgetting it). > > > > > > How the hash is used to locate entries on the table (and doubling of the > > table) > > > is covered on the design doc. But I don't fully follow what you're thinking. > > > > > > thanks.
Responses to your comments; continuing my review. https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:463: memset(params->extra_table, 0, extra_size * sizeof(IndexBucket)); On 2013/12/04 01:04:17, rvargas wrote: > On 2013/12/02 21:48:06, rdsmith wrote: > > On 2013/11/26 00:32:41, rvargas wrote: > > > On 2013/11/25 19:48:07, rdsmith wrote: > > > > It feels to me like there are assumptions in this code about aliasing, but > > I'm > > > > not certain what they are. Let me go over what I believe the context for > > > this > > > > code is and make comments conditional on that. > > > > > > > > I believe that this function will be called in only three situations: > > > > * First initialization. main_table_ will be null, params->main_table will > > > point > > > > to memory mapped from a file, and params->extra_table will be null. > > > > > > params->extra_table should not be null. > > > > So what's the size of the extra_table on init? Just the header? > > There is no header on the extra table, but it has to have some space there to be > able to store something. The reason is that no matter how big the main table is, > it may take as little as 5 entries to require space on the extra table. Backend > unit tests start with something like 8 cells, and production code starts with > 1/8th of the main table. > > > > > > > > > > * Growth of extra table. params->main_table will be NULL. > > > params->extra_table > > > > != extra_table_, but will point to a memory mapped region of the same > file, > > > only > > > > larger. Thus any changes made to params->extra_table will be reflected in > > > > extra_table_ and vice versa (for the length that extra_table_ is mapped > > for). > > > > extra_table_ will be unmapped after this function returns (does the > calling > > > code > > > > have it's own pointer to extra_table_?) > > > > > > Yes, someone will know how to unmap the old table after this method returns. > > > > > > > * Doubling of main table. params->main_table != main_table_, but will > point > > > to > > > > a memory mapped region of the same file, only doubled in size. > > > > params->extra_table will point to a section of memory mapped to the extra > > > table > > > > file, that hasn't changed in size any. params->extra_table != > extra_table_ > > > and > > > > the location pointed to by extra_table_ will be unmapped after this > routine > > > > finishes (??), but the two memory regions will be direct aliases to one > > > another. > > > > > > sounds correct. > > > > > > The underlying assumption is that the index is memory mapped and follows the > > > format (and files) described in disk_format_v3.h, so the act of > > re-initializing > > > the IndexTable is a consequence of growing some file and creating another > map > > of > > > the new extension. As such, we should never receive the same pointer that we > > are > > > currently using (but there's no check for that, because it shouldn't really > > > matter), and we deal with pointers not mapped sections (in fact all the unit > > > tests just allocate memory, never map anything). > > > > > > Aliasing is important in that we don't memcpy data from the old mapping to > the > > > new mapping when the file just grows, as that would be a waste of time. So > > yes, > > > this has a memory interface, but expects the caller to deal with the files > as > > > defined by the file format, not just random pointers (hence > > > TestCacheTables::CopyFrom simulating that behavior) > > > > > > But that should be clear from the comment at the top of index_table.h, > right? > > > > Re-reading that comment, I find it more abstract that I'd like; I'd rather the > > aliasing assumptions were made explicit, either in the comment to this > function > > or in the comments to IndexTableInitData. But see later. > > I'm a little worried about this comment. The header mentions multiple files that > can be remapped while the cache is working and that management is external to > this file, so it looks pretty concrete to me. Without the files being mmapped, > the whole IndexTable would > be completely impractical so that's not exactly a detail. > > I'll extend the comment for Init() I'm good with the IndexTable code explicitly relying on the memory mapping. My preference would be that that memory mapping and the aliasing it implies was called out actually in a comment near the code, but if you'd prefer a pointer to the design doc I'm ok with that. > > > > > > > > (I'll probably want a comment clarifying these assumptions somewhere, but > > I'll > > > > wait to suggest what until I'm certain I understand it all. :-}) > > > > > > > > Assuming that that's correct, why refer to the same data through two > > different > > > > points (extra_table_ and params->extra_table)? It seems like it's just > > going > > > to > > > > generate confusion. > > > > > > Do I? I try to refer to the data that logically is being accessed... so for > > > instance line 461 copies the old data somewhere safe and line 463 clears the > > > storage for the new data. Production code could copy the new data and clear > > the > > > old data, but that would confuse the reader and prevent the unit tests from > > > working :) > > > > > > I mean, the whole thing requires mapping (or a judicious emulation from the > > > caller), but if this code is misusing source or destination somewhere > assuming > > > that they are the same, let me know so that it's fixed. > > > > > > So, conceptually, the caller could suspend all requests to this class, copy > > the > > > file to another file and map the new file (so that there is no aliasing) > > before > > > calling into this class again asking for re-init. > > > > Ah! Thank you. That's a (hypothetical) use case that's very useful for me. > > Can we put in a comment (before this function or in IndexTableInitData) that > if > > this function is being called a second time, the caller of Init guarantees > that > > the data pointed to by params->main_table will be an improper (i.e. possibly > > identical) super-set of what was previously passed, and that if main_table > isn't > > being grown, params->extra_table will similarly be a super-set of what was > > previously passed? > > > > I'd also like to have the callers responsibility to deallocate the memory that > > main_table_ and extra_table_ used to point to made explicit. If one isn't > > thinking about this function in terms of memory mapped files, it's somewhat > > surprising; it effectively means that the caller owns that memory, which I > don't > > think is the default assumption one would have looking at the call. > > > > The memory ownership should be clear by the definition of IndexTableInitData. > There is ownership transfer for the backups, but not for the bitmap or tables. > That's the (relatively new) coding standard so it feels a little weird if that > is repeated by comments. Ok, that makes sense. > https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:473: DCHECK_LE(extra_bits_, 11); On 2013/12/04 01:04:17, rvargas wrote: > On 2013/12/02 21:48:06, rdsmith wrote: > > On 2013/11/26 00:32:41, rvargas wrote: > > > On 2013/11/25 19:48:07, rdsmith wrote: > > > > What results in this 11? This seems like you're simply saying that you > > can't > > > > have a table_len greater than 32 bits, in which case don't you want to > just > > > say > > > > that? (I.e. IIRTCC the 11 is dependent on a specific value of > > kBaseTableLen, > > > > and that dependency should be explicit in the DCHECK). > > > > > > This is just saying that the table can be doubled 11 times from the smaller > > > table to the biggest table. Yes, the value (11) is derived from > kBaseTableLen > > > and maybe kMaxAddress or kCellAddressMask, but I thought it was clearer to > > have > > > the number of bits here (right when we are saying "how many extra bits can > we > > > have") rather to come out with a probably cryptic formula that derives the > > > proper value from other numbers. The cost is of course less flexibility if > the > > > format change, but at least it will fail right when it is needed. When I was > > > writing the code this value changed right here multiple times depending on > > what > > > I had implemented (and was ready to test) so it seemed better to keep it > that > > > way instead of tying it to other constants. > > > > Huh. I don't particularly care about future flexibility (I mean, future > > flexibility is good, just not my highest current priority :-}). But from my > > perspective in reading the code, having the cryptic formulas calls out to the > > reader what the dependencies between the different constants are. I had been > > thinking of asking if we could have constants defined in disk_format_v3.h for > > the different pieces of the IndexCell, and then define the other constants > that > > need to match those based on those constants. It means if you're tracing the > > "Why is this number this value?" question, you get roadsigns. > > > > Can you say more about why having an 11 is clearer? I'm not getting it. It > > doesn't feel like it tells the code reader/maintainer/debugger anything. > > > > > I can define a local constant here if you don't like it. > > > > I'm not interested in "const uint32 kExtraBitsLimit = 11;", if that's what > > you're offering; I don't find that any more useful than just using 11. But > I'd > > be interested in your thoughts about my general argument above. > > > > > > My best example would be the failure of the formula two lines below this one. > > I can come up with something like the log of the largest table divided by the > smallest table, but then that rises the issue of deriving the largest table. I > hear what you are saying and I generally agree, to the point of having a bunch > of constants with meaningful names whose values can be derived easily by just > inspection (the reason for the constants here). > > In this particular case, we have a line that calculates a number from the > comparison of the current table versus the smallest table. The first check says > we cannot have a negative number (a table smaller than the smallest), the second > line checks the value against an upper bound, and that should tell right away > that we are checking for the largest possible table (right?) > > The only question is then why the largest table is 11 times larger than the > smallest table, or in other words, what is the size of the largest table. > Honestly, if I'm reading some code that I didn't write, I would not care much > about the value of that number (the actual limit for the test): I would mostly > care for some limit being there, either on a check, or in the code that uses > that number. In this case, it would mean to me the same if that value is 8 or > 15, or derived from a formula. 8 could be wrong, the formula could be off by one > and in any case I would not spend time looking at the value unless I'm actually > trying to debug a problem and I have reason to suspect that the max value could > be at fault. And if I'm doing that, I would probably ignore the formula, figure > out what the value should be and compare that against what the code says (number > of bits on an address = 22, so 21 of them go to the main table; smaller table = > 1024 (10 bits), so delta = 21 - 10) > > And that's why I prefer a plain number here rather than a formula... given that > too much may be read from the formula (as in the check against 6). What I think > it matters is not the value, but the check itself (double check an upper limit, > whatever that is) > > I hope I don't sound combative, that's not my intention. I just want to answer > the question before we setup in what to do. Nope, you don't sound combative, and I appreciate the detailed explanation. > > I'm up for moving relevant constants to the file format header. That's why > kBaseTableLen is there. Shifts and masks I'm not so sure as they make more sense > with the code that uses them, and I don't want to make them visible to any code > outside this file. I think most of my feeling on this topic would be addressed by making explicit the relationships between the different constants--I don't have strong general feelings about either where the constants are declared or this particular location being a number rather than a constant 11. But I would like constants that are used in multiple places in the code and do have a relationship to each other to be expressed in terms of each other. So for instance, disk_format_v3.h would have k{SmallTable,}HashOffset and k{SmallTable,}HashLength, and kHash{SmallTable,}Shift and kCell{SmallTable,}HashMask would be defined from those values (probably in index_table.cc). Would you be good with that? Having said all that, the place I diverge from you in modeling this code being read by someone who doesn't know it is, if they suspected the max value would be at fault, the formula would give them a hint as to how that max was derived. That hint might be wrong, but it's structure that would lead them further into the code, and point at the logic that they would need to vet. But as I say, I more care about the relationship between the defined constants being clear than about this particular number being a formula. https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:499: DCHECK_GT(old_num_words, main_table_bit_words); On 2013/12/04 01:04:17, rvargas wrote: > On 2013/12/02 21:48:06, rdsmith wrote: > > Any reason this couldn't be equal? (I.e. if we're doubling without having any > > extra table in between). > > An extra table is always required. Reasonable to have a DCHECK that we're getting an extra table, maybe with a quick comment that we always need a little bit of space in the extra table? https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:519: backup_header_.reset(params->backup_header.release()); On 2013/12/04 01:04:17, rvargas wrote: > On 2013/12/02 21:48:06, rdsmith wrote: > > What's the context for these parameters? They're only used on the first call > to > > Init(); are they defined on other calls? And where do they come from on the > > first call to Init()? What do they contain? It feels a bit weird to have > this > > function responsible for allocating storage on growing, but not on first > > initialization. > > There are dedicated files to store the backups. When the cache starts, those > files are read and the contents are provided to this method, so that we can > check the backup against the mmaped bitmap. When growing, there is no need to > read any file because the up to date version is already in memory, owned by this > object. So that means at some point in the future there will be code checking the backup bitmaps against the current bitmaps on first load? Also, what's the allocation behavior for params->backup_header and backup_bitmap on calls to Init() after the first? There's not a code correctness issue, but it's sorta a shame if they keep getting allocated and then deleted without being examined. https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:922: header()->max_bucket = mask_; On 2013/12/04 01:04:17, rvargas wrote: > On 2013/12/02 21:48:06, rdsmith wrote: > > nit: This seems like a strange place to do this assignment. My assumption, > when > > I saw the call to MoveCells() in Init(), was that all it did was move the > cells, > > but this is a global state change. To me, it would more naturally be paired > > with the zeroing of extra_table; that's the point at which the extra_table > > entries become unavailable to the normal lookup (to be made available again in > > this routine as the cells are put in one at a time). > > Sort of. The state of this object is in flux while Init is running, and this > method only runs inside Init, so is not like the internal state should be > consistent when this method starts. Fair, though also tricky to realize for a code reader, since MoveCells() is neither next to Init() nor commented to be only a subroutine of Init(). Explanatory comment? > This method receives a pointer to the extra > table because the core of Init is switching tables and bitmaps and clearing > buffers for this method. But it seems a little silly to clear a member in Init > only to pass its value as an explicit argument somewhere else that would be in > charge of setting the final value for the member. So to be clear, the image in my mind was of setting header()->max_bucket = mask_; in Init() after the call to MoveCells(). I don't see any dependencies on header()->max_bucket in this routine, so I don't think any explicit arguments would be required. It might be worth a comment in Init() that "MoveCells() depends on the pre-doubling values of header()->max_bucket", but what's really true is that MoveCells() needs to be understood in the context of Init(); there isn't a good interface contract between the two routines. > I mean, the final value of > max_bucket is only stable when MoveCells completes so it seemed better to > delegate all actions related to that field to this method. (as always, let me > know if you feel strongly about doing that on Init) I don't care strongly; I think at base what I'm reacting to is that MoveCells() doesn't make sense without knowing the particular point in Init() when it's called (i.e. it's not particularly easy to define an interface contract). But I don't see a way to fix that; I do think pulling it out as a subroutine makes sense just to keep Init() from being a very long function, but it only sorta helps on the abstraction level. > > Also, it occurs to me reading this code: Where do we fix up the bucket chains > > that are left dangling by folding the extra table into the main table on a > > doubling? > > I don't follow the question. We take a chain one at a time and follow it to the > end, moving some entries from the main table and all entries from the extra > table (to the new extra table) I was wondering when the value of bucket->next was updated for the buckets in the main table. It looks to me as if the only two places where the next field is set are in CreateEntryCell() and in GetNextBucket(). Naively, that would seem to imply to me that after moving all the cells from the extra table to the main table in MoveCells(), we've got pointers into invalid parts of the extra table still dangling from the main table buckets. If GetNextBucket() is called on those buckets before any buckets are added to the extra table, that function will fix/zero that pointer. But if we add an extra bucket to the extra table before calling GetNextBucket() on the main table bucket that used to point to that extra table, we'll have an incorrect pointer into that bucket. Am I confused? https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:997: small_table_ = true; On 2013/12/04 01:04:17, rvargas wrote: > On 2013/12/02 21:48:06, rdsmith wrote: > > Willing to put a comment in here somewhere about why the flipping back and > forth > > of small_table_ over the course of this algorithm? > > yeah, not so happy about having to do that. Added a comment. I'm not sure I like this suggestion, but: You could make the small/large table distinction be based on the small_table_ variable in EntryCell, and pass that bit directly to CreateEntryCell. The minus of this, of course, is that it loses an assertion opportunity that EntryCells and the table they are being saved into have the same value. You could (again not sure I like the suggestion, just thinking it through) have small_table_ in IndexTable be a tri-state, with a "transitioning" value, and only do the assertion when you aren't transitioning (or assert as if its a large table on writing, but treat it like a small table on reading, though I'm not 100% sure that would work). Just a thought; up to you what, if anything, you do with it. https://codereview.chromium.org/53313004/diff/1930001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1930001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:687: break; This pattern doesn't have much virtue over if conditionals if it has a default branch; the advantage (at least to me) of using a switch statement is that if you don't have a default and the enum changes, the compiler warns you that you have to update the switch statement. (Still completely a suggestion; just sharing with you a coding pattern I've found useful in the past.)
This time there is no code update. I'm still considering what to do with the mismatched entry path. ... and I'll be out next week. https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:499: DCHECK_GT(old_num_words, main_table_bit_words); On 2013/12/05 19:17:48, rdsmith wrote: > On 2013/12/04 01:04:17, rvargas wrote: > > On 2013/12/02 21:48:06, rdsmith wrote: > > > Any reason this couldn't be equal? (I.e. if we're doubling without having > any > > > extra table in between). > > > > An extra table is always required. > > Reasonable to have a DCHECK that we're getting an extra table, maybe with a > quick comment that we always need a little bit of space in the extra table? Done. https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:519: backup_header_.reset(params->backup_header.release()); On 2013/12/05 19:17:48, rdsmith wrote: > On 2013/12/04 01:04:17, rvargas wrote: > > On 2013/12/02 21:48:06, rdsmith wrote: > > > What's the context for these parameters? They're only used on the first > call > > to > > > Init(); are they defined on other calls? And where do they come from on the > > > first call to Init()? What do they contain? It feels a bit weird to have > > this > > > function responsible for allocating storage on growing, but not on first > > > initialization. > > > > There are dedicated files to store the backups. When the cache starts, those > > files are read and the contents are provided to this method, so that we can > > check the backup against the mmaped bitmap. When growing, there is no need to > > read any file because the up to date version is already in memory, owned by > this > > object. > > So that means at some point in the future there will be code checking the backup > bitmaps against the current bitmaps on first load? Checking against the backup is already in place. It is performed on demand by CheckState() > Also, what's the allocation behavior for params->backup_header and backup_bitmap > on calls to Init() after the first? There's not a code correctness issue, but > it's sorta a shame if they keep getting allocated and then deleted without being > examined. The caller doesn't pass them. That would require reading the files, and that doesn't happen for reinit. https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:922: header()->max_bucket = mask_; On 2013/12/05 19:17:48, rdsmith wrote: > On 2013/12/04 01:04:17, rvargas wrote: > > On 2013/12/02 21:48:06, rdsmith wrote: > > > nit: This seems like a strange place to do this assignment. My assumption, > > when > > > I saw the call to MoveCells() in Init(), was that all it did was move the > > cells, > > > but this is a global state change. To me, it would more naturally be paired > > > with the zeroing of extra_table; that's the point at which the extra_table > > > entries become unavailable to the normal lookup (to be made available again > in > > > this routine as the cells are put in one at a time). > > > > Sort of. The state of this object is in flux while Init is running, and this > > method only runs inside Init, so is not like the internal state should be > > consistent when this method starts. > > Fair, though also tricky to realize for a code reader, since MoveCells() is > neither next to Init() nor commented to be only a subroutine of Init(). > Explanatory comment? > done > > This method receives a pointer to the extra > > table because the core of Init is switching tables and bitmaps and clearing > > buffers for this method. But it seems a little silly to clear a member in Init > > only to pass its value as an explicit argument somewhere else that would be in > > charge of setting the final value for the member. > > So to be clear, the image in my mind was of setting header()->max_bucket = > mask_; in Init() after the call to MoveCells(). I don't see any dependencies on If I understand correctly, that would fail. The act of writing new cells into the tables (and that happens when old cells are moved), changes max_bucket, as that is just the last used cell from the extra table (or more exactly, the first unused bucket). A write to the main table is fine, but any write to the extra table may move max_bucket forward. > header()->max_bucket in this routine, so I don't think any explicit arguments > would be required. It might be worth a comment in Init() that "MoveCells() > depends on the pre-doubling values of header()->max_bucket", but what's really > true is that MoveCells() needs to be understood in the context of Init(); there > isn't a good interface contract between the two routines. yeah, I added a comment on MoveCells; let me know if that's not enough. > > > > I mean, the final value of > > max_bucket is only stable when MoveCells completes so it seemed better to > > delegate all actions related to that field to this method. (as always, let me > > know if you feel strongly about doing that on Init) > > I don't care strongly; I think at base what I'm reacting to is that MoveCells() > doesn't make sense without knowing the particular point in Init() when it's > called (i.e. it's not particularly easy to define an interface contract). But > I don't see a way to fix that; I do think pulling it out as a subroutine makes > sense just to keep Init() from being a very long function, but it only sorta > helps on the abstraction level. Exactly. It is more a "now we need to sit down and move stuff, but we are not really done with init" > > > > Also, it occurs to me reading this code: Where do we fix up the bucket > chains > > > that are left dangling by folding the extra table into the main table on a > > > doubling? > > > > I don't follow the question. We take a chain one at a time and follow it to > the > > end, moving some entries from the main table and all entries from the extra > > table (to the new extra table) > > I was wondering when the value of bucket->next was updated for the buckets in > the main table. It looks to me as if the only two places where the next field > is set are in CreateEntryCell() and in GetNextBucket(). Naively, that would > seem to imply to me that after moving all the cells from the extra table to the > main table in MoveCells(), we've got pointers into invalid parts of the extra > table still dangling from the main table buckets. If GetNextBucket() is called > on those buckets before any buckets are added to the extra table, that function > will fix/zero that pointer. But if we add an extra bucket to the extra table > before calling GetNextBucket() on the main table bucket that used to point to > that extra table, we'll have an incorrect pointer into that bucket. Am I > confused? That sounds correct (seems like there's a bug there). https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:997: small_table_ = true; On 2013/12/05 19:17:48, rdsmith wrote: > On 2013/12/04 01:04:17, rvargas wrote: > > On 2013/12/02 21:48:06, rdsmith wrote: > > > Willing to put a comment in here somewhere about why the flipping back and > > forth > > > of small_table_ over the course of this algorithm? > > > > yeah, not so happy about having to do that. Added a comment. > > I'm not sure I like this suggestion, but: You could make the small/large table > distinction be based on the small_table_ variable in EntryCell, and pass that > bit directly to CreateEntryCell. The minus of this, of course, is that it loses > an assertion opportunity that EntryCells and the table they are being saved into > have the same value. You could (again not sure I like the suggestion, just > thinking it through) have small_table_ in IndexTable be a tri-state, with a > "transitioning" value, and only do the assertion when you aren't transitioning > (or assert as if its a large table on writing, but treat it like a small table > on reading, though I'm not 100% sure that would work). Just a thought; up to > you what, if anything, you do with it. thx https://codereview.chromium.org/53313004/diff/1930001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1930001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:471: // extra_bits_ is really measured against table-size specific values. > So for instance, disk_format_v3.h would have k{SmallTable,}HashOffset > and k{SmallTable,}HashLength, and kHash{SmallTable,}Shift and > kCell{SmallTable,}HashMask would be defined from those values >(probably in index_table.cc). Would you be good with that? They look the same to me (as in there is no length now). Right now there is Offset and Mask; introducing another one to derive one of them from the other doesn't sound right to me, when all values are derived from visual inspection of the bit patterns on the comments of the file format. If you really want to make them visible in file_format.h I'll do that, but I've come to regret that the equivalent masks are visible on addr.h, when there is exactly one compilation unit that needs them. If what you want is point the reader in the right direction, I can comment out a formula here. Note that the "real" formula requires logarithms, so it's not like I can have a real compile time constant that way. https://codereview.chromium.org/53313004/diff/1930001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:687: break; On 2013/12/05 19:17:48, rdsmith wrote: > This pattern doesn't have much virtue over if conditionals if it has a default > branch; the advantage (at least to me) of using a switch statement is that if > you don't have a default and the enum changes, the compiler warns you that you > have to update the switch statement. > > (Still completely a suggestion; just sharing with you a coding pattern I've > found useful in the past.) Yes, I understand your point. But the switch doesn't cover all cases so between extending it to cover all cases with nothing and a default I prefer having a default (in general), even if it "breaks" if the enum changes. On the other hand, if changing one line makes you happy I'm happy.
New version up, ptal. https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:922: header()->max_bucket = mask_; On 2013/12/07 02:13:45, rvargas wrote: > On 2013/12/05 19:17:48, rdsmith wrote: > > On 2013/12/04 01:04:17, rvargas wrote: > > > On 2013/12/02 21:48:06, rdsmith wrote: > > > > nit: This seems like a strange place to do this assignment. My > assumption, > > > when > > > > I saw the call to MoveCells() in Init(), was that all it did was move the > > > cells, > > > > but this is a global state change. To me, it would more naturally be > paired > > > > with the zeroing of extra_table; that's the point at which the extra_table > > > > entries become unavailable to the normal lookup (to be made available > again > > in > > > > this routine as the cells are put in one at a time). > > > > > > Sort of. The state of this object is in flux while Init is running, and this > > > method only runs inside Init, so is not like the internal state should be > > > consistent when this method starts. > > > > Fair, though also tricky to realize for a code reader, since MoveCells() is > > neither next to Init() nor commented to be only a subroutine of Init(). > > Explanatory comment? > > > > done > > > > This method receives a pointer to the extra > > > table because the core of Init is switching tables and bitmaps and clearing > > > buffers for this method. But it seems a little silly to clear a member in > Init > > > only to pass its value as an explicit argument somewhere else that would be > in > > > charge of setting the final value for the member. > > > > So to be clear, the image in my mind was of setting header()->max_bucket = > > mask_; in Init() after the call to MoveCells(). I don't see any dependencies > on > > If I understand correctly, that would fail. The act of writing new cells into > the tables (and that happens when old cells are moved), changes max_bucket, as > that is just the last used cell from the extra table (or more exactly, the first > unused bucket). A write to the main table is fine, but any write to the extra > table may move max_bucket forward. > > > header()->max_bucket in this routine, so I don't think any explicit arguments > > would be required. It might be worth a comment in Init() that "MoveCells() > > depends on the pre-doubling values of header()->max_bucket", but what's really > > true is that MoveCells() needs to be understood in the context of Init(); > there > > isn't a good interface contract between the two routines. > > yeah, I added a comment on MoveCells; let me know if that's not enough. > > > > > > > > I mean, the final value of > > > max_bucket is only stable when MoveCells completes so it seemed better to > > > delegate all actions related to that field to this method. (as always, let > me > > > know if you feel strongly about doing that on Init) > > > > I don't care strongly; I think at base what I'm reacting to is that > MoveCells() > > doesn't make sense without knowing the particular point in Init() when it's > > called (i.e. it's not particularly easy to define an interface contract). > But > > I don't see a way to fix that; I do think pulling it out as a subroutine makes > > sense just to keep Init() from being a very long function, but it only sorta > > helps on the abstraction level. > > Exactly. It is more a "now we need to sit down and move stuff, but we are not > really done with init" > > > > > > > Also, it occurs to me reading this code: Where do we fix up the bucket > > chains > > > > that are left dangling by folding the extra table into the main table on a > > > > doubling? > > > > > > I don't follow the question. We take a chain one at a time and follow it to > > the > > > end, moving some entries from the main table and all entries from the extra > > > table (to the new extra table) > > > > I was wondering when the value of bucket->next was updated for the buckets in > > the main table. It looks to me as if the only two places where the next field > > is set are in CreateEntryCell() and in GetNextBucket(). Naively, that would > > seem to imply to me that after moving all the cells from the extra table to > the > > main table in MoveCells(), we've got pointers into invalid parts of the extra > > table still dangling from the main table buckets. If GetNextBucket() is > called > > on those buckets before any buckets are added to the extra table, that > function > > will fix/zero that pointer. But if we add an extra bucket to the extra table > > before calling GetNextBucket() on the main table bucket that used to point to > > that extra table, we'll have an incorrect pointer into that bucket. Am I > > confused? > > That sounds correct (seems like there's a bug there). There was actually no bug because the old value will point to the main table (and not the extra table) so it will be ignored by GetNextBucket. I added a comment and another unit test.
On 2013/12/03 00:41:32, rvargas wrote: > On 2013/12/02 19:28:01, rdsmith wrote: > > On 2013/11/26 00:48:56, rvargas wrote: > > > On 2013/11/25 19:48:07, rdsmith wrote: > > > > I feel as if both address and hash could benefit from a comment somewhere > > > > indicating that they mean different things in different parts of the code, > > > with > > > > an explanation as to what those differences are (if this is in the design > > doc, > > > > that's fine, and my apologies for forgetting it). > > > > > > How the hash is used to locate entries on the table (and doubling of the > > table) > > > is covered on the design doc. But I don't fully follow what you're thinking. > > > > > > thanks. > > > > So the design document does cover what the hash is and how it's used to lookup > > entries (and thank you--it was useful for me to go back to the design doc. I > > think I get more from it each time I go over it after reading the code, not > > surprisingly). But my concern is more that the concepts "hash" and "address" > > mean different things at different places in the code, and I think it's easy > for > > the reader to assume that they mean the same thing, and be confused that way. > > If they were consistently referred to by different names in the code, that > would > > address my concern. I'm not sure if this should be a high priority--if the > > concepts are clear, it may be that a code reader will be able to track by > > context which meaning of hash/address is currently relevant. But that's the > > concern I was raising. > > > > With regard to addresses, I think my confusion is between what I'll call "full > > addresses" (i.e. what Addr() passes around), and "address values" (what you > get > > from IndexCells and EntryCell::GetAddressValue()). I think it would be > clearer > > if those things used different names consistently. Does this make sense to > you? > > I think that would mean: > > * Changing the names of {Get,Set}Cell{SmallTable,}Address & > > {FileNumber,StartBlock}FromAddress, along with the variables used > > inside those functions? > > * I think based on its usage in IsValidAddress, kMaxAddress would also > > become kMaxAddressValue. > > Sure, as I said before, this can be improved by a different name but I have no > good ideas about what to call it. > > So here's the whole picture: > disk_cache::Addr -> this is the regular address passed around most of the code > (I mean, this is visible to most files under disk_cache), and it is the general > address of anything (note that the type is not Address) > > disk_cache::Addr::value == disk_cache::CacheAddr == uint32. This is what is > stored inside an Addr, and it is just a number of a fixed length. I mentioned > because it has a dedicated type name (and some code uses it, of course), and > because it "has" the nickname of "address value" as it is accessible as that > from any Addr. > > And now the new neighbors: > The address of a Cell -> currently also known as "address", "defined" on the > file format as that, and with a variable bit length. Originally meant to be > basically the cell number, or cell index (but index is a very bad name). Should > be fairly private to this file (index_table.cc) > > With the added twist that once you have an EntyCell, given that it is meant to > be used from the outside, it must provide a for of address, and so it translates > from cell number to Addr (so Address() returns Addr). But on the other hand, it > cannot hide the short version from other code _from this file_, as this file is > the one who is aware of the details of how many bits go where, so it provides > private methods to get to the "real" cell number. > > As such, I would not want to go to AddressValue everywhere (even though today > that's the closest I have to be able to tell that it means a cell number), > because it is too close to Addr::value. > > I would be happier if what is stored inside a Cell is a Fraby and so Address() > returns Addr and GetFraby returns Fraby. The caveat of course is that there is a > very simple translation from Fraby to CacheAddr. > > How about Location for anything that relates to the cell number? That would be fine, but any reason not to use CellNumber? "Location" sounds more general, and "CellNumber" sounds a bit more like an implementation detail, which I think is the more correct impression. (But "Location" is fine if you prefer it.) > > > > > With regard to hash, similarly there's the portion of the hash that's stored > in > > IndexCell (call it "high hash"), the portion that's implied by the position of > > the IndexCell within the IndexTable ("low hash"), and the whole thing (gotten > > from base::Hash() and split up as needed, "full hash"). It looks to me as if > > what I'm calling the low hash is generally referred to as a bucket hash, so it > > may be the only real confusion is between high hash and full hash (for > example, > > the EntryCell constructor takes a full hash and passes a high hash to > > SetCellHash, storing the full hash in hash_). I'm not sure what would be > > involved in clearing this up or if it's worth it, but I suspect that just > using > > different words for the high hash and the full hash (whatever they are) would > > reduce confusion in code readers. > > Ah, now I see what you meant. > > On the same lines, I could rename the thing that lives inside a cell as "id" as > in it is what identifies whatever is stored in that cell. It may be a little > weird how the id is derived from part of the hash (versus it _is_ part of the > hash), but the cell doesn't really care what is there (not that it cares about > any other field, but I digress) > > uint32 IndexTable::GetFullHash(const IndexCell& cell, uint32 lower_part) { > // It is OK for the high order bits of lower_part to overlap with the stored > // part of the hash. > if (small_table_) > return (GetCellSmallTableId(cell) << kHashSmallTableShift) | lower_part; > > return (GetCellId(cell) << kHashShift) | lower_part; > } > > That would change bucket_hash to something else (and bucket_id, cell_id)... > maybe bucket_hash becomes bucket_id. Sure, that sounds good. I'd like a comment (or a design doc section, which may already exist :-}) that describes the relationship between these concepts, but I do like separating them out. And I like not naming them anything with "hash" in it, as that implies a reasonably close, non-complicated relationship to the actually hash, and the relationship there *is* a bit complicated.
On 2013/12/04 20:59:34, rdsmith wrote: > On 2013/12/02 21:01:49, rvargas wrote: > > On 2013/12/02 18:53:42, rdsmith wrote: > > > On 2013/11/26 00:48:56, rvargas wrote: > > > > On 2013/11/25 19:48:07, rdsmith wrote: > > > > > Part-way through reviewing Init(), and getting somewhat bogged down in > > > > > understanding the use of hash values, so pausing to provide these > comments > > > and > > > > > ask for clarifications on hash. I believe that: > > > > > * The hash for a particular key is 32 bits, generated by base::Hash(key) > > > > > * The bottom bits (a variable number of bits, depending on > > table_len/mask_) > > > > are > > > > > used as the address of the relevant bucket for that entry. > > > > > * The top bits (32 - kHash{SmallTable,}Shift) are stored in the > IndexCell > > > and > > > > > compared on cell lookup. That shift means that we're comparing just the > > > > number > > > > > of bits stored in the table for each size; i.e. kHash{SmallTable,}Shift > is > > > > > dependent on the number of bits stored for the hash in the different > types > > > of > > > > > table/cells. > > > > > * The top and bottom bits in this scheme may overlap. > > > > > * When we increase the table size via doubling (i.e. increase the number > > of > > > > > buckets), that may mean that the extra bit (or bits, if we haven't > touched > > a > > > > > cell from previous doublings) used in indexing into the table may be > > wrong, > > > > and > > > > > the cell may need to be moved. This is what the MisplacedCell() logic > is > > > > about. > > > > > > > > Roughly yes. As the table grows, bits move from the "we need to store this > > > part" > > > > to the "this part is implied given the position in the table". There is no > > > need > > > > to store store less and less bits each time (and complicate all the logic) > > so > > > > that means that each time we double the size, there's another bit that > > becomes > > > > redundant. > > > > > > > > When we double the table, we move all cells that have to moved to the > upper > > > half > > > > of the table (about half the cells should move). If that goes well, there > > will > > > > never be a misplaced cell. However, if there is a crash while we are > moving > > > > cells (even though this is a memory-only operation), we could end up with > > some > > > > cells on the right place and some cells in the wrong place... those > > misplaced > > > > cells are detected by a mismatch on the redundant bits, and can be moved > as > > we > > > > see them. > > > > > > That makes sense. Where is the logic that moves the cells on table > doubling? > > > Scanning Init(), I only see a call to MoveCells for the old_extra_table > cells. > > > > > > At the end of Init(), there is a call to MoveCells(). MoveSingleCell() goes > > through a loop and calls MoveSingleCell() when needed... note that the > function > > goes through main_table_, even though it needs the caller to provide a pointer > > to the old_extra_table. > > Right--I had incorrectly thought that MoveCells() was only about the extra > table. Thanks for the pointer. > > > > > > > > > > > Two flags get raised in my brain around crash handling (== unlikely but not > > > impossible corner case) code. The first is that it needs to be tested > > > thoroughly, because our chances of noticing problems in it from more general > > > testing and the field is lower. Quickly skimming the unit test, I'm not > sure > > I > > > see anything that will result in exercising the > > > MisplacedHash/HandleMisplacedHash logic; am I missing it? > > > > > > The second is that code to handle rare corner cases can sometimes result in > > > masking other bugs. It'd be nice if there was some way to disable the > > misplaced > > > hash logic for the tests that aren't trying to emulate the crashing code. > > > > > > > The tests are not going through inconsistent states or checks at this point... > > some of the checks performed will crash the test if they fail (because most > > tests run without a backend)... and the tests themselves are not generating > bad > > data (but I must admit that was a manual process... as in they did for a > while, > > and then some step of the test will fail, as not being able to locate > > something). > > > > I'm not too worried at this point with error recovery paths not covered by > > tests; I have not even written all those paths to start with, and it is more > > important to bring the whole system up and start running unit tests of the > whole > > thing. > > I'm happy to go with postponing corner case testing until later in the process, > but I would sorta like to keep to the invariant that code has tests associated > with it when it's committed. Does that make sense to you? Any easy way to add > tests for the misplaced cell logic? I didn't see a response to this comment in the comment thread or code; did you have any reaction to it? > > > > > > > > WDYT? > > > > > > (Still working on review; responding to this now just because it's a > top-level > > > comment and thus easiest to respond to directly.) > > > > > > > > > > > > > > > > > > > > > > That all correct? > > > > > > > > > > I feel as if both address and hash could benefit from a comment > somewhere > > > > > indicating that they mean different things in different parts of the > code, > > > > with > > > > > an explanation as to what those differences are (if this is in the > design > > > doc, > > > > > that's fine, and my apologies for forgetting it). > > > > > > > > How the hash is used to locate entries on the table (and doubling of the > > > table) > > > > is covered on the design doc. But I don't fully follow what you're > thinking. > > > > > > > > thanks.
On 2013/12/26 19:39:30, rdsmith wrote: > On 2013/12/04 20:59:34, rdsmith wrote: > > On 2013/12/02 21:01:49, rvargas wrote: > > > On 2013/12/02 18:53:42, rdsmith wrote: > > > > On 2013/11/26 00:48:56, rvargas wrote: > > > > > On 2013/11/25 19:48:07, rdsmith wrote: > > > > > > Part-way through reviewing Init(), and getting somewhat bogged down in > > > > > > understanding the use of hash values, so pausing to provide these > > comments > > > > and > > > > > > ask for clarifications on hash. I believe that: > > > > > > * The hash for a particular key is 32 bits, generated by > base::Hash(key) > > > > > > * The bottom bits (a variable number of bits, depending on > > > table_len/mask_) > > > > > are > > > > > > used as the address of the relevant bucket for that entry. > > > > > > * The top bits (32 - kHash{SmallTable,}Shift) are stored in the > > IndexCell > > > > and > > > > > > compared on cell lookup. That shift means that we're comparing just > the > > > > > number > > > > > > of bits stored in the table for each size; i.e. > kHash{SmallTable,}Shift > > is > > > > > > dependent on the number of bits stored for the hash in the different > > types > > > > of > > > > > > table/cells. > > > > > > * The top and bottom bits in this scheme may overlap. > > > > > > * When we increase the table size via doubling (i.e. increase the > number > > > of > > > > > > buckets), that may mean that the extra bit (or bits, if we haven't > > touched > > > a > > > > > > cell from previous doublings) used in indexing into the table may be > > > wrong, > > > > > and > > > > > > the cell may need to be moved. This is what the MisplacedCell() logic > > is > > > > > about. > > > > > > > > > > Roughly yes. As the table grows, bits move from the "we need to store > this > > > > part" > > > > > to the "this part is implied given the position in the table". There is > no > > > > need > > > > > to store store less and less bits each time (and complicate all the > logic) > > > so > > > > > that means that each time we double the size, there's another bit that > > > becomes > > > > > redundant. > > > > > > > > > > When we double the table, we move all cells that have to moved to the > > upper > > > > half > > > > > of the table (about half the cells should move). If that goes well, > there > > > will > > > > > never be a misplaced cell. However, if there is a crash while we are > > moving > > > > > cells (even though this is a memory-only operation), we could end up > with > > > some > > > > > cells on the right place and some cells in the wrong place... those > > > misplaced > > > > > cells are detected by a mismatch on the redundant bits, and can be moved > > as > > > we > > > > > see them. > > > > > > > > That makes sense. Where is the logic that moves the cells on table > > doubling? > > > > Scanning Init(), I only see a call to MoveCells for the old_extra_table > > cells. > > > > > > > > > At the end of Init(), there is a call to MoveCells(). MoveSingleCell() goes > > > through a loop and calls MoveSingleCell() when needed... note that the > > function > > > goes through main_table_, even though it needs the caller to provide a > pointer > > > to the old_extra_table. > > > > Right--I had incorrectly thought that MoveCells() was only about the extra > > table. Thanks for the pointer. > > > > > > > > > > > > > > > > Two flags get raised in my brain around crash handling (== unlikely but > not > > > > impossible corner case) code. The first is that it needs to be tested > > > > thoroughly, because our chances of noticing problems in it from more > general > > > > testing and the field is lower. Quickly skimming the unit test, I'm not > > sure > > > I > > > > see anything that will result in exercising the > > > > MisplacedHash/HandleMisplacedHash logic; am I missing it? > > > > > > > > The second is that code to handle rare corner cases can sometimes result > in > > > > masking other bugs. It'd be nice if there was some way to disable the > > > misplaced > > > > hash logic for the tests that aren't trying to emulate the crashing code. > > > > > > > > > > > The tests are not going through inconsistent states or checks at this > point... > > > some of the checks performed will crash the test if they fail (because most > > > tests run without a backend)... and the tests themselves are not generating > > bad > > > data (but I must admit that was a manual process... as in they did for a > > while, > > > and then some step of the test will fail, as not being able to locate > > > something). > > > > > > I'm not too worried at this point with error recovery paths not covered by > > > tests; I have not even written all those paths to start with, and it is more > > > important to bring the whole system up and start running unit tests of the > > whole > > > thing. > > > > I'm happy to go with postponing corner case testing until later in the > process, > > but I would sorta like to keep to the invariant that code has tests associated > > with it when it's committed. Does that make sense to you? Any easy way to > add > > tests for the misplaced cell logic? > > I didn't see a response to this comment in the comment thread or code; did you > have any reaction to it? Whoops, now I see the NOTREACHED(); in HandleMisplacedCells(). I'm good with that. > > > > > > > > > > > > > WDYT? > > > > > > > > (Still working on review; responding to this now just because it's a > > top-level > > > > comment and thus easiest to respond to directly.) > > > > > > > > > > > > > > > > > > > > > > > > > > > That all correct? > > > > > > > > > > > > I feel as if both address and hash could benefit from a comment > > somewhere > > > > > > indicating that they mean different things in different parts of the > > code, > > > > > with > > > > > > an explanation as to what those differences are (if this is in the > > design > > > > doc, > > > > > > that's fine, and my apologies for forgetting it). > > > > > > > > > > How the hash is used to locate entries on the table (and doubling of the > > > > table) > > > > > is covered on the design doc. But I don't fully follow what you're > > thinking. > > > > > > > > > > thanks.
Next round of comments. One high level question, that doesn't have much if any relationship to this CL: In the design doc, you mention that at some point when you're getting close to overflow on the timestamp you'll need to rebase all timestamps on the table. I don't seen any logic for this operation in your CL 17507006 (the whole v3 disk cache change); is this TBD? It'll obviously (?) need to be done before this code is released into the wild. https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:473: DCHECK_LE(extra_bits_, 11); On 2013/12/05 19:17:48, rdsmith wrote: > On 2013/12/04 01:04:17, rvargas wrote: > > On 2013/12/02 21:48:06, rdsmith wrote: > > > On 2013/11/26 00:32:41, rvargas wrote: > > > > On 2013/11/25 19:48:07, rdsmith wrote: > > > > > What results in this 11? This seems like you're simply saying that you > > > can't > > > > > have a table_len greater than 32 bits, in which case don't you want to > > just > > > > say > > > > > that? (I.e. IIRTCC the 11 is dependent on a specific value of > > > kBaseTableLen, > > > > > and that dependency should be explicit in the DCHECK). > > > > > > > > This is just saying that the table can be doubled 11 times from the > smaller > > > > table to the biggest table. Yes, the value (11) is derived from > > kBaseTableLen > > > > and maybe kMaxAddress or kCellAddressMask, but I thought it was clearer to > > > have > > > > the number of bits here (right when we are saying "how many extra bits can > > we > > > > have") rather to come out with a probably cryptic formula that derives the > > > > proper value from other numbers. The cost is of course less flexibility if > > the > > > > format change, but at least it will fail right when it is needed. When I > was > > > > writing the code this value changed right here multiple times depending on > > > what > > > > I had implemented (and was ready to test) so it seemed better to keep it > > that > > > > way instead of tying it to other constants. > > > > > > Huh. I don't particularly care about future flexibility (I mean, future > > > flexibility is good, just not my highest current priority :-}). But from my > > > perspective in reading the code, having the cryptic formulas calls out to > the > > > reader what the dependencies between the different constants are. I had > been > > > thinking of asking if we could have constants defined in disk_format_v3.h > for > > > the different pieces of the IndexCell, and then define the other constants > > that > > > need to match those based on those constants. It means if you're tracing > the > > > "Why is this number this value?" question, you get roadsigns. > > > > > > Can you say more about why having an 11 is clearer? I'm not getting it. It > > > doesn't feel like it tells the code reader/maintainer/debugger anything. > > > > > > > I can define a local constant here if you don't like it. > > > > > > I'm not interested in "const uint32 kExtraBitsLimit = 11;", if that's what > > > you're offering; I don't find that any more useful than just using 11. But > > I'd > > > be interested in your thoughts about my general argument above. > > > > > > > > > > My best example would be the failure of the formula two lines below this one. > > > > I can come up with something like the log of the largest table divided by the > > smallest table, but then that rises the issue of deriving the largest table. I > > hear what you are saying and I generally agree, to the point of having a bunch > > of constants with meaningful names whose values can be derived easily by just > > inspection (the reason for the constants here). > > > > In this particular case, we have a line that calculates a number from the > > comparison of the current table versus the smallest table. The first check > says > > we cannot have a negative number (a table smaller than the smallest), the > second > > line checks the value against an upper bound, and that should tell right away > > that we are checking for the largest possible table (right?) > > > > The only question is then why the largest table is 11 times larger than the > > smallest table, or in other words, what is the size of the largest table. > > Honestly, if I'm reading some code that I didn't write, I would not care much > > about the value of that number (the actual limit for the test): I would mostly > > care for some limit being there, either on a check, or in the code that uses > > that number. In this case, it would mean to me the same if that value is 8 or > > 15, or derived from a formula. 8 could be wrong, the formula could be off by > one > > and in any case I would not spend time looking at the value unless I'm > actually > > trying to debug a problem and I have reason to suspect that the max value > could > > be at fault. And if I'm doing that, I would probably ignore the formula, > figure > > out what the value should be and compare that against what the code says > (number > > of bits on an address = 22, so 21 of them go to the main table; smaller table > = > > 1024 (10 bits), so delta = 21 - 10) > > > > And that's why I prefer a plain number here rather than a formula... given > that > > too much may be read from the formula (as in the check against 6). What I > think > > it matters is not the value, but the check itself (double check an upper > limit, > > whatever that is) > > > > I hope I don't sound combative, that's not my intention. I just want to answer > > the question before we setup in what to do. > > Nope, you don't sound combative, and I appreciate the detailed explanation. > > > > > I'm up for moving relevant constants to the file format header. That's why > > kBaseTableLen is there. Shifts and masks I'm not so sure as they make more > sense > > with the code that uses them, and I don't want to make them visible to any > code > > outside this file. > > I think most of my feeling on this topic would be addressed by making explicit > the relationships between the different constants--I don't have strong general > feelings about either where the constants are declared or this particular > location being a number rather than a constant 11. But I would like constants > that are used in multiple places in the code and do have a relationship to each > other to be expressed in terms of each other. So for instance, disk_format_v3.h > would have k{SmallTable,}HashOffset and k{SmallTable,}HashLength, and > kHash{SmallTable,}Shift and kCell{SmallTable,}HashMask would be defined from > those values (probably in index_table.cc). Would you be good with that? Any thoughts on this request? An extreme example that I don't think I've previously mentioned: kBaseTableLen is defined in three different places (backend_impl.cc, backend_impl_v3.cc, and backend_worker.cc, disk_format_v3.h), and the definition in the .h file looks to contradict the common definition in the .cc files. > > Having said all that, the place I diverge from you in modeling this code being > read by someone who doesn't know it is, if they suspected the max value would be > at fault, the formula would give them a hint as to how that max was derived. > That hint might be wrong, but it's structure that would lead them further into > the code, and point at the logic that they would need to vet. But as I say, I > more care about the relationship between the defined constants being clear than > about this particular number being a formula. https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:519: backup_header_.reset(params->backup_header.release()); On 2013/12/07 02:13:45, rvargas wrote: > On 2013/12/05 19:17:48, rdsmith wrote: > > On 2013/12/04 01:04:17, rvargas wrote: > > > On 2013/12/02 21:48:06, rdsmith wrote: > > > > What's the context for these parameters? They're only used on the first > > call > > > to > > > > Init(); are they defined on other calls? And where do they come from on > the > > > > first call to Init()? What do they contain? It feels a bit weird to have > > > this > > > > function responsible for allocating storage on growing, but not on first > > > > initialization. > > > > > > There are dedicated files to store the backups. When the cache starts, those > > > files are read and the contents are provided to this method, so that we can > > > check the backup against the mmaped bitmap. When growing, there is no need > to > > > read any file because the up to date version is already in memory, owned by > > this > > > object. > > > > So that means at some point in the future there will be code checking the > backup > > bitmaps against the current bitmaps on first load? > > Checking against the backup is already in place. It is performed on demand by > CheckState() > > > > Also, what's the allocation behavior for params->backup_header and > backup_bitmap > > on calls to Init() after the first? There's not a code correctness issue, but > > it's sorta a shame if they keep getting allocated and then deleted without > being > > examined. > > The caller doesn't pass them. That would require reading the files, and that > doesn't happen for reinit. Clarifying comment, either on IndexTableInitData or Init()? (Side note: I've been on-and-off tempted reading this code to suggest that it be pulled out into three separate functions for the three separate things that could be happening: initialization, re-init without doubling (increase in the extra table), and re-init with doubling, because as I've read the function I've had to spend some mental energy tracking those three pathways and make sure each is clean. But I haven't been able to convince myself that the code would actually be cleaner and more maintainable that way; it would involve a lot more functions that each branch would call, and the obvious way to implement it would be for Init() to branch when first entered, so I haven't brought it up. I'm just mentioning it now in case the various requests I'm making for comments to help keep those three things straight are onerous enough that the alternative has appeal :-}.) https://codereview.chromium.org/53313004/diff/1930001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1930001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:687: break; On 2013/12/07 02:13:45, rvargas wrote: > On 2013/12/05 19:17:48, rdsmith wrote: > > This pattern doesn't have much virtue over if conditionals if it has a default > > branch; the advantage (at least to me) of using a switch statement is that if > > you don't have a default and the enum changes, the compiler warns you that you > > have to update the switch statement. > > > > (Still completely a suggestion; just sharing with you a coding pattern I've > > found useful in the past.) > > Yes, I understand your point. But the switch doesn't cover all cases so between > extending it to cover all cases with nothing and a default I prefer having a > default (in general), even if it "breaks" if the enum changes. On the other > hand, if changing one line makes you happy I'm happy. No worries; just a suggestion. https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/disk_... File net/disk_cache/v3/disk_format_v3.h (right): https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/disk_... net/disk_cache/v3/disk_format_v3.h:61: CACHE_EVICTED = 1 << 2 // Already evicted at least one entry. nit, suggestion: It's a bit worrisome to see a non-backward compatible change in a file describing disk format. I poked around in the code, and as I'm sure you know, this is fine because CACHE_EVICT* aren't used anywhere. But is there a reason not to have the discipline of adding new flags at the end of the enum so that that type of thing doesn't need to be verified? (I suppose that strictly speaking you should always bump the version when you modify anything in the format, and I can completely understand not wanting to do that during development, and as long as you're in development you're the only person who'll be inconvenienced by mistakes in this space. Thus "nit, suggestion" :-}. But it does look a bit weird from the outside.) https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:990: // the link. This comment is useful for someone reading through to try and understand initialization, but the place where I think the comment would be most useful is in GetNextBucket(). If this dangling pointer ever does cause problems, it'll be there. https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.h (right): https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:146: virtual void SaveIndex(net::IOBuffer* buffer, int buffer_len) = 0; nit, suggestion: My scan of the code makes me think that we don't actually need the complexity of the IOBuffer here; it looks like OnBackupTimer() just passes ownership to the backend (the reference is dropped as soon as SaveIndex() is called) and that nothing in the backend uses the refcounting/cancellation semantics of IOBuffer. If that's the case, it would seem simpler to pass a scoped_array<char>? https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:180: // called just to reinitialize the object. Willing to say a bit about what it should be used for? I wasn't completely clear looking at the usage in the total V3 CL what the purpose of this is; is it just a shutdown method? In which case calling it "Shutdown()" might make more sense. https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table_unittest.cc (right): https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index... net/disk_cache/v3/index_table_unittest.cc:574: // Tests doubling of the table. A couple of cases I'm not certain this test covers that I think would be good to make sure are covered: * Doublings that will require moving cells in the table because of the boundary change as to what portion of the hash is used for bucket_id. * Cells in the extra table on doubling, which will need to be successfully copied into the main table. https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index... net/disk_cache/v3/index_table_unittest.cc:615: // Tests bucket chaining when growing the index. Why does using the upper half of the table mean we're exercising bucket chaining? Don't we need more than four entries with the same bucket id to test chaining? https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index... net/disk_cache/v3/index_table_unittest.cc:706: } Possibly also a test for the logic in CheckState()?
On 2013/12/26 19:37:24, rdsmith wrote: > On 2013/12/03 00:41:32, rvargas wrote: > > On 2013/12/02 19:28:01, rdsmith wrote: > > > On 2013/11/26 00:48:56, rvargas wrote: > > > > On 2013/11/25 19:48:07, rdsmith wrote: > > > > > I feel as if both address and hash could benefit from a comment > somewhere > > > > > indicating that they mean different things in different parts of the > code, > > > > with > > > > > an explanation as to what those differences are (if this is in the > design > > > doc, > > > > > that's fine, and my apologies for forgetting it). > > > > > > > > How the hash is used to locate entries on the table (and doubling of the > > > table) > > > > is covered on the design doc. But I don't fully follow what you're > thinking. > > > > > > > > thanks. > > > > > > So the design document does cover what the hash is and how it's used to > lookup > > > entries (and thank you--it was useful for me to go back to the design doc. > I > > > think I get more from it each time I go over it after reading the code, not > > > surprisingly). But my concern is more that the concepts "hash" and > "address" > > > mean different things at different places in the code, and I think it's easy > > for > > > the reader to assume that they mean the same thing, and be confused that > way. > > > If they were consistently referred to by different names in the code, that > > would > > > address my concern. I'm not sure if this should be a high priority--if the > > > concepts are clear, it may be that a code reader will be able to track by > > > context which meaning of hash/address is currently relevant. But that's the > > > concern I was raising. > > > > > > With regard to addresses, I think my confusion is between what I'll call > "full > > > addresses" (i.e. what Addr() passes around), and "address values" (what you > > get > > > from IndexCells and EntryCell::GetAddressValue()). I think it would be > > clearer > > > if those things used different names consistently. Does this make sense to > > you? > > > I think that would mean: > > > * Changing the names of {Get,Set}Cell{SmallTable,}Address & > > > {FileNumber,StartBlock}FromAddress, along with the variables used > > > inside those functions? > > > * I think based on its usage in IsValidAddress, kMaxAddress would also > > > become kMaxAddressValue. > > > > Sure, as I said before, this can be improved by a different name but I have no > > good ideas about what to call it. > > > > So here's the whole picture: > > disk_cache::Addr -> this is the regular address passed around most of the code > > (I mean, this is visible to most files under disk_cache), and it is the > general > > address of anything (note that the type is not Address) > > > > disk_cache::Addr::value == disk_cache::CacheAddr == uint32. This is what is > > stored inside an Addr, and it is just a number of a fixed length. I mentioned > > because it has a dedicated type name (and some code uses it, of course), and > > because it "has" the nickname of "address value" as it is accessible as that > > from any Addr. > > > > And now the new neighbors: > > The address of a Cell -> currently also known as "address", "defined" on the > > file format as that, and with a variable bit length. Originally meant to be > > basically the cell number, or cell index (but index is a very bad name). > Should > > be fairly private to this file (index_table.cc) > > > > With the added twist that once you have an EntyCell, given that it is meant to > > be used from the outside, it must provide a for of address, and so it > translates > > from cell number to Addr (so Address() returns Addr). But on the other hand, > it > > cannot hide the short version from other code _from this file_, as this file > is > > the one who is aware of the details of how many bits go where, so it provides > > private methods to get to the "real" cell number. > > > > As such, I would not want to go to AddressValue everywhere (even though today > > that's the closest I have to be able to tell that it means a cell number), > > because it is too close to Addr::value. > > > > I would be happier if what is stored inside a Cell is a Fraby and so Address() > > returns Addr and GetFraby returns Fraby. The caveat of course is that there is > a > > very simple translation from Fraby to CacheAddr. > > > > How about Location for anything that relates to the cell number? > > That would be fine, but any reason not to use CellNumber? "Location" sounds > more general, and "CellNumber" sounds a bit more like an implementation detail, > which I think is the more correct impression. (But "Location" is fine if you > prefer it.) I meant "Fraby" -> "Location". In other words, the number that is stored inside a cell would become "Location" instead of "Address". CellNumber implies the number of a given cell inside the table (10th cell, part of the 3rd bucket etc). > > > > > > > > > With regard to hash, similarly there's the portion of the hash that's stored > > in > > > IndexCell (call it "high hash"), the portion that's implied by the position > of > > > the IndexCell within the IndexTable ("low hash"), and the whole thing > (gotten > > > from base::Hash() and split up as needed, "full hash"). It looks to me as > if > > > what I'm calling the low hash is generally referred to as a bucket hash, so > it > > > may be the only real confusion is between high hash and full hash (for > > example, > > > the EntryCell constructor takes a full hash and passes a high hash to > > > SetCellHash, storing the full hash in hash_). I'm not sure what would be > > > involved in clearing this up or if it's worth it, but I suspect that just > > using > > > different words for the high hash and the full hash (whatever they are) > would > > > reduce confusion in code readers. > > > > Ah, now I see what you meant. > > > > On the same lines, I could rename the thing that lives inside a cell as "id" > as > > in it is what identifies whatever is stored in that cell. It may be a little > > weird how the id is derived from part of the hash (versus it _is_ part of the > > hash), but the cell doesn't really care what is there (not that it cares about > > any other field, but I digress) > > > > uint32 IndexTable::GetFullHash(const IndexCell& cell, uint32 lower_part) { > > // It is OK for the high order bits of lower_part to overlap with the stored > > // part of the hash. > > if (small_table_) > > return (GetCellSmallTableId(cell) << kHashSmallTableShift) | lower_part; > > > > return (GetCellId(cell) << kHashShift) | lower_part; > > } > > > > That would change bucket_hash to something else (and bucket_id, cell_id)... > > maybe bucket_hash becomes bucket_id. > > Sure, that sounds good. I'd like a comment (or a design doc section, which may > already exist :-}) that describes the relationship between these concepts, but I > do like separating them out. And I like not naming them anything with "hash" in > it, as that implies a reasonably close, non-complicated relationship to the > actually hash, and the relationship there *is* a bit complicated.
On 2013/12/26 21:45:48, rdsmith wrote: > Next round of comments. > > One high level question, that doesn't have much if any relationship to this CL: > In the design doc, you mention that at some point when you're getting close to > overflow on the timestamp you'll need to rebase all timestamps on the table. I > don't seen any logic for this operation in your CL 17507006 (the whole v3 disk > cache change); is this TBD? It'll obviously (?) need to be done before this > code is released into the wild. There is a todo on IndexTable::UpdateTime. The timestamp has a dynamic range of 728 days, and we baseline it to 3 months in the past so that gives us about 21 months to implement and deploy that before it is actually needed.
On 2013/12/26 23:44:15, rvargas wrote: > On 2013/12/26 19:37:24, rdsmith wrote: > > On 2013/12/03 00:41:32, rvargas wrote: > > > On 2013/12/02 19:28:01, rdsmith wrote: > > > > On 2013/11/26 00:48:56, rvargas wrote: > > > > > On 2013/11/25 19:48:07, rdsmith wrote: > > > > > > I feel as if both address and hash could benefit from a comment > > somewhere > > > > > > indicating that they mean different things in different parts of the > > code, > > > > > with > > > > > > an explanation as to what those differences are (if this is in the > > design > > > > doc, > > > > > > that's fine, and my apologies for forgetting it). > > > > > > > > > > How the hash is used to locate entries on the table (and doubling of the > > > > table) > > > > > is covered on the design doc. But I don't fully follow what you're > > thinking. > > > > > > > > > > thanks. > > > > > > > > So the design document does cover what the hash is and how it's used to > > lookup > > > > entries (and thank you--it was useful for me to go back to the design doc. > > > I > > > > think I get more from it each time I go over it after reading the code, > not > > > > surprisingly). But my concern is more that the concepts "hash" and > > "address" > > > > mean different things at different places in the code, and I think it's > easy > > > for > > > > the reader to assume that they mean the same thing, and be confused that > > way. > > > > If they were consistently referred to by different names in the code, that > > > would > > > > address my concern. I'm not sure if this should be a high priority--if > the > > > > concepts are clear, it may be that a code reader will be able to track by > > > > context which meaning of hash/address is currently relevant. But that's > the > > > > concern I was raising. > > > > > > > > With regard to addresses, I think my confusion is between what I'll call > > "full > > > > addresses" (i.e. what Addr() passes around), and "address values" (what > you > > > get > > > > from IndexCells and EntryCell::GetAddressValue()). I think it would be > > > clearer > > > > if those things used different names consistently. Does this make sense > to > > > you? > > > > I think that would mean: > > > > * Changing the names of {Get,Set}Cell{SmallTable,}Address & > > > > {FileNumber,StartBlock}FromAddress, along with the variables used > > > > inside those functions? > > > > * I think based on its usage in IsValidAddress, kMaxAddress would also > > > > become kMaxAddressValue. > > > > > > Sure, as I said before, this can be improved by a different name but I have > no > > > good ideas about what to call it. > > > > > > So here's the whole picture: > > > disk_cache::Addr -> this is the regular address passed around most of the > code > > > (I mean, this is visible to most files under disk_cache), and it is the > > general > > > address of anything (note that the type is not Address) > > > > > > disk_cache::Addr::value == disk_cache::CacheAddr == uint32. This is what is > > > stored inside an Addr, and it is just a number of a fixed length. I > mentioned > > > because it has a dedicated type name (and some code uses it, of course), and > > > because it "has" the nickname of "address value" as it is accessible as that > > > from any Addr. > > > > > > And now the new neighbors: > > > The address of a Cell -> currently also known as "address", "defined" on the > > > file format as that, and with a variable bit length. Originally meant to be > > > basically the cell number, or cell index (but index is a very bad name). > > Should > > > be fairly private to this file (index_table.cc) > > > > > > With the added twist that once you have an EntyCell, given that it is meant > to > > > be used from the outside, it must provide a for of address, and so it > > translates > > > from cell number to Addr (so Address() returns Addr). But on the other hand, > > it > > > cannot hide the short version from other code _from this file_, as this file > > is > > > the one who is aware of the details of how many bits go where, so it > provides > > > private methods to get to the "real" cell number. > > > > > > As such, I would not want to go to AddressValue everywhere (even though > today > > > that's the closest I have to be able to tell that it means a cell number), > > > because it is too close to Addr::value. > > > > > > I would be happier if what is stored inside a Cell is a Fraby and so > Address() > > > returns Addr and GetFraby returns Fraby. The caveat of course is that there > is > > a > > > very simple translation from Fraby to CacheAddr. > > > > > > How about Location for anything that relates to the cell number? > > > > That would be fine, but any reason not to use CellNumber? "Location" sounds > > more general, and "CellNumber" sounds a bit more like an implementation > detail, > > which I think is the more correct impression. (But "Location" is fine if you > > prefer it.) > > > I meant "Fraby" -> "Location". In other words, the number that is stored inside > a cell would become "Location" instead of "Address". > > CellNumber implies the number of a given cell inside the table (10th cell, part > of the 3rd bucket etc). "Location" is fine. > > > > > > > > > > > > > > > > With regard to hash, similarly there's the portion of the hash that's > stored > > > in > > > > IndexCell (call it "high hash"), the portion that's implied by the > position > > of > > > > the IndexCell within the IndexTable ("low hash"), and the whole thing > > (gotten > > > > from base::Hash() and split up as needed, "full hash"). It looks to me as > > if > > > > what I'm calling the low hash is generally referred to as a bucket hash, > so > > it > > > > may be the only real confusion is between high hash and full hash (for > > > example, > > > > the EntryCell constructor takes a full hash and passes a high hash to > > > > SetCellHash, storing the full hash in hash_). I'm not sure what would be > > > > involved in clearing this up or if it's worth it, but I suspect that just > > > using > > > > different words for the high hash and the full hash (whatever they are) > > would > > > > reduce confusion in code readers. > > > > > > Ah, now I see what you meant. > > > > > > On the same lines, I could rename the thing that lives inside a cell as "id" > > as > > > in it is what identifies whatever is stored in that cell. It may be a little > > > weird how the id is derived from part of the hash (versus it _is_ part of > the > > > hash), but the cell doesn't really care what is there (not that it cares > about > > > any other field, but I digress) > > > > > > uint32 IndexTable::GetFullHash(const IndexCell& cell, uint32 lower_part) { > > > // It is OK for the high order bits of lower_part to overlap with the > stored > > > // part of the hash. > > > if (small_table_) > > > return (GetCellSmallTableId(cell) << kHashSmallTableShift) | lower_part; > > > > > > return (GetCellId(cell) << kHashShift) | lower_part; > > > } > > > > > > That would change bucket_hash to something else (and bucket_id, cell_id)... > > > maybe bucket_hash becomes bucket_id. > > > > Sure, that sounds good. I'd like a comment (or a design doc section, which > may > > already exist :-}) that describes the relationship between these concepts, but > I > > do like separating them out. And I like not naming them anything with "hash" > in > > it, as that implies a reasonably close, non-complicated relationship to the > > actually hash, and the relationship there *is* a bit complicated.
On 2013/12/26 23:52:33, rvargas wrote: > On 2013/12/26 21:45:48, rdsmith wrote: > > Next round of comments. > > > > One high level question, that doesn't have much if any relationship to this > CL: > > In the design doc, you mention that at some point when you're getting close to > > overflow on the timestamp you'll need to rebase all timestamps on the table. > I > > don't seen any logic for this operation in your CL 17507006 (the whole v3 disk > > cache change); is this TBD? It'll obviously (?) need to be done before this > > code is released into the wild. > > There is a todo on IndexTable::UpdateTime. SG. > The timestamp has a dynamic range of 728 days, and we baseline it to 3 months in > the past so that gives us about 21 months to implement and deploy that before it > is actually needed. Yeah ... I'd recommend we actually get something out there that will work at least ok before initial deployment. The issue is that while most of the fleet will auto-upgrade on time, we regularly have problems with some percentage of installations not being upgraded. It would be a bit better if those didn't fail hard at the 21 month mark.
renamed version up. https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:473: DCHECK_LE(extra_bits_, 11); > > I think most of my feeling on this topic would be addressed by making explicit > > the relationships between the different constants--I don't have strong general > > feelings about either where the constants are declared or this particular > > location being a number rather than a constant 11. But I would like constants > > that are used in multiple places in the code and do have a relationship to > each > > other to be expressed in terms of each other. So for instance, > disk_format_v3.h > > would have k{SmallTable,}HashOffset and k{SmallTable,}HashLength, and > > kHash{SmallTable,}Shift and kCell{SmallTable,}HashMask would be defined from > > those values (probably in index_table.cc). Would you be good with that? > > Any thoughts on this request? I tried to move the conversation to the latest version of the code, where I added comments about it. > An extreme example that I don't think I've previously mentioned: kBaseTableLen > is defined in three different places (backend_impl.cc, backend_impl_v3.cc, and > backend_worker.cc, disk_format_v3.h), and the definition in the .h file looks to > contradict the common definition in the .cc files. That is just a transient detail... - backend_impl.cc: nothing to do about it (old format) - backend_impl_v3.cc: old, unused code - backend_worker.cc: either old, unused code, or using the value from disk_format_v3.h - disk_format_v3.h: definition for the new format. https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:519: backup_header_.reset(params->backup_header.release()); On 2013/12/26 21:45:49, rdsmith wrote: > On 2013/12/07 02:13:45, rvargas wrote: > > On 2013/12/05 19:17:48, rdsmith wrote: > > > On 2013/12/04 01:04:17, rvargas wrote: > > > > On 2013/12/02 21:48:06, rdsmith wrote: > > > > > What's the context for these parameters? They're only used on the first > > > call > > > > to > > > > > Init(); are they defined on other calls? And where do they come from on > > the > > > > > first call to Init()? What do they contain? It feels a bit weird to > have > > > > this > > > > > function responsible for allocating storage on growing, but not on first > > > > > initialization. > > > > > > > > There are dedicated files to store the backups. When the cache starts, > those > > > > files are read and the contents are provided to this method, so that we > can > > > > check the backup against the mmaped bitmap. When growing, there is no need > > to > > > > read any file because the up to date version is already in memory, owned > by > > > this > > > > object. > > > > > > So that means at some point in the future there will be code checking the > > backup > > > bitmaps against the current bitmaps on first load? > > > > Checking against the backup is already in place. It is performed on demand by > > CheckState() > > > > > > > Also, what's the allocation behavior for params->backup_header and > > backup_bitmap > > > on calls to Init() after the first? There's not a code correctness issue, > but > > > it's sorta a shame if they keep getting allocated and then deleted without > > being > > > examined. > > > > The caller doesn't pass them. That would require reading the files, and that > > doesn't happen for reinit. > > Clarifying comment, either on IndexTableInitData or Init()? > > (Side note: I've been on-and-off tempted reading this code to suggest that it be > pulled out into three separate functions for the three separate things that > could be happening: initialization, re-init without doubling (increase in the > extra table), and re-init with doubling, because as I've read the function I've > had to spend some mental energy tracking those three pathways and make sure each > is clean. But I haven't been able to convince myself that the code would > actually be cleaner and more maintainable that way; it would involve a lot more > functions that each branch would call, and the obvious way to implement it would > be for Init() to branch when first entered, so I haven't brought it up. I'm > just mentioning it now in case the various requests I'm making for comments to > help keep those three things straight are onerous enough that the alternative > has appeal :-}.) Done. https://codereview.chromium.org/53313004/diff/1930001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1930001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:471: // extra_bits_ is really measured against table-size specific values. This was the answer to making some constants depend on some others (quoted on the previous comment). Note that revision 11 adds the comments I mention here. https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/disk_... File net/disk_cache/v3/disk_format_v3.h (right): https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/disk_... net/disk_cache/v3/disk_format_v3.h:61: CACHE_EVICTED = 1 << 2 // Already evicted at least one entry. On 2013/12/26 21:45:49, rdsmith wrote: > nit, suggestion: It's a bit worrisome to see a non-backward compatible change in > a file describing disk format. I poked around in the code, and as I'm sure you > know, this is fine because CACHE_EVICT* aren't used anywhere. But is there a > reason not to have the discipline of adding new flags at the end of the enum so > that that type of thing doesn't need to be verified? > > (I suppose that strictly speaking you should always bump the version when you > modify anything in the format, and I can completely understand not wanting to do > that during development, and as long as you're in development you're the only > person who'll be inconvenienced by mistakes in this space. Thus "nit, > suggestion" :-}. But it does look a bit weird from the outside.) I intend to make 3.0 the version of the file that is released to users (Beta ideally). Anything before that is not a stable file format, doesn't have users or even code that can be compiled out of TOT, so we should be free to change whatever we'd like to without worrying about compatibility. A flag stating how to read the table should go before one that specifies an eviction algorithm to be used. https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:990: // the link. On 2013/12/26 21:45:49, rdsmith wrote: > This comment is useful for someone reading through to try and understand > initialization, but the place where I think the comment would be most useful is > in GetNextBucket(). If this dangling pointer ever does cause problems, it'll be > there. Done. https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.h (right): https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:146: virtual void SaveIndex(net::IOBuffer* buffer, int buffer_len) = 0; On 2013/12/26 21:45:49, rdsmith wrote: > nit, suggestion: My scan of the code makes me think that we don't actually need > the complexity of the IOBuffer here; it looks like OnBackupTimer() just passes > ownership to the backend (the reference is dropped as soon as SaveIndex() is > called) and that nothing in the backend uses the refcounting/cancellation > semantics of IOBuffer. If that's the case, it would seem simpler to pass a > scoped_array<char>? At the end, the buffer is going to be written to disk with one of the normal write functions (and they all take IOBuffers). The overall premise is always that the initial source of the data should allocate an IOBuffer instead of performing copies or wrappers in the middle. Don't forget that having multiple threads inlolved in IO (and tasks bouncing) implies that there are cancellation semantics involved, even if they are not explicit. https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.h:180: // called just to reinitialize the object. On 2013/12/26 21:45:49, rdsmith wrote: > Willing to say a bit about what it should be used for? I wasn't completely > clear looking at the usage in the total V3 CL what the purpose of this is; is it > just a shutdown method? In which case calling it "Shutdown()" might make more > sense. Done. https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table_unittest.cc (right): https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index... net/disk_cache/v3/index_table_unittest.cc:574: // Tests doubling of the table. On 2013/12/26 21:45:49, rdsmith wrote: > A couple of cases I'm not certain this test covers that I think would be good to > make sure are covered: > * Doublings that will require moving cells in the table because of the boundary > change as to what portion of the hash is used for bucket_id. > * Cells in the extra table on doubling, which will need to be successfully > copied into the main table. Both things are covered by this test. https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index... net/disk_cache/v3/index_table_unittest.cc:615: // Tests bucket chaining when growing the index. On 2013/12/26 21:45:49, rdsmith wrote: > Why does using the upper half of the table mean we're exercising bucket > chaining? Don't we need more than four entries with the same bucket id to test > chaining? Yes, we need more than 4. We have 8 entries in the chain, alternating the bit that would select entries of the upper half after doubling. So we end up with two chains of 4 entries each. The second part of the test writes one entry to each chain, starting with the one on the second half, so the links should be "inverted" at the end (0 -> second extra, upper half -> first extra). If there was a problem with the links when doubling, the original chain would be wrongly chained before receiving the new 5th entry (and could even point to the extra bucket in use by the other chain, or even the bucket with the other four entries). https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index... net/disk_cache/v3/index_table_unittest.cc:706: } On 2013/12/26 21:45:49, rdsmith wrote: > Possibly also a test for the logic in CheckState()? Every lookup goes through that code, so the tests are already covering it. The only part that is not yet covered is the ENTRY_FIXING, but the backend doesn't have that code yet so writing a test that calls FixCell seems pointless so far. BTW, the current tests already verify that FixCell() is never called because they run with a NULL backend.
https://codereview.chromium.org/53313004/diff/2050001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/2050001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:25: const int kCellIdOffset = 22; How about a comment before this group of constants indicating that these are concretizing the description of IndexCell from disk_format_v3.h and need to be kept in sync with that description? https://codereview.chromium.org/53313004/diff/2050001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:32: const uint64 kCellLocationMask = 0x3FFFFF; Suggestion/nit: I'd personally find these easier to understand if a) the Offset and Mask were near each other, and b) the masks were put in the form of |(1 << n) - 1|. That'd make it very easy for me to see "shift by these number of bytes and grab this number of bytes". https://codereview.chromium.org/53313004/diff/2050001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:32: const uint64 kCellLocationMask = 0x3FFFFF; Suggestion/nit: As sorta implied by the above comment, I'd value having a kCellLocationOffset = 0, even though it isn't needed. Currently, the information that the cell location's offset is zero is contained in GetCellLocation below; creating a constant is extra verbiage, but collects that information with other information of its ilk. https://codereview.chromium.org/53313004/diff/2050001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:42: const int kHashShift = 14; Comment documenting that these are to turn a hash into an ID, and so not reflecting the comments in disk_format_v3.h? https://codereview.chromium.org/53313004/diff/2050001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:210: // that are relinked when growing the table. Would you be willing to say something like "When the table is grown, all buckets are moved from the extra_table to the main table, but the pointers to them from the main table are not updated. Those dangling pointers are recognized here because they point into the main table (i.e. are < min_bucket_num)." I don't care about the wording, but I want to be explicit that this is expected in normal behavior, and is a cleanup from table growth. As an aside, based on the above, I feel like |bucket_num > max_bucket_num| suggests table corruption, whereas |bucket_num < min_bucket_num| is expected cleanup. Worth separating those out? Noting the corruption case? https://codereview.chromium.org/53313004/diff/2050001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:489: // kMaxExtraBitsSmallTable = base::bits::Log2Floor(max 16 bit table) - 10. Thank you for these comments.
Thanks for your time. https://codereview.chromium.org/53313004/diff/2050001/net/disk_cache/v3/index... File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/2050001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:25: const int kCellIdOffset = 22; On 2014/01/06 22:28:35, rdsmith wrote: > How about a comment before this group of constants indicating that these are > concretizing the description of IndexCell from disk_format_v3.h and need to be > kept in sync with that description? Done. https://codereview.chromium.org/53313004/diff/2050001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:32: const uint64 kCellLocationMask = 0x3FFFFF; I tried intermixing the values, but I found that to be very hard to read. Instead, they are now following the same order of comments of disk_format.h so it should be trivial to check a mismatch. As for constants with 0, I can add them if you feel strongly about them, but the offsets don't specify from where the offset is taken, so it is not like specifying zeroes define the "value" without looking at the code that uses it. I'm slightly against adding dead code. https://codereview.chromium.org/53313004/diff/2050001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:42: const int kHashShift = 14; On 2014/01/06 22:28:35, rdsmith wrote: > Comment documenting that these are to turn a hash into an ID, and so not > reflecting the comments in disk_format_v3.h? Done. https://codereview.chromium.org/53313004/diff/2050001/net/disk_cache/v3/index... net/disk_cache/v3/index_table.cc:210: // that are relinked when growing the table. On 2014/01/06 22:28:35, rdsmith wrote: > Would you be willing to say something like "When the table is grown, all buckets > are moved from the extra_table to the main table, but the pointers to them from > the main table are not updated. Those dangling pointers are recognized here > because they point into the main table (i.e. are < min_bucket_num)." I don't > care about the wording, but I want to be explicit that this is expected in > normal behavior, and is a cleanup from table growth. > > As an aside, based on the above, I feel like |bucket_num > max_bucket_num| > suggests table corruption, whereas |bucket_num < min_bucket_num| is expected > cleanup. Worth separating those out? Noting the corruption case? Extended the comment. I'm not sure what would be the advantage of separating the conditions, given that the action is the same. "corruption" is a little loose term given that a big part of v3 is that there is no corruption... as in the code will just fix whatever needs to be fixed and ignores whatever cannot be fixed.
LGTM. Sorry it's taken so long :-{.
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/rvargas@chromium.org/53313004/2470001
Message was sent while issue was closed.
Change committed as 244027 |