Issue 53313004: Disk cache v3: The main index table.

rvargas (doing something else)

7 years, 1 month ago (2013-10-30 22:26:41 UTC) #1

Randy Smith (Not in Mondays)

Ricardo: I'm getting an error accessing this via the "view" links; when I click through ...

7 years, 1 month ago (2013-10-31 17:53:20 UTC) #2

rvargas (doing something else)

On 2013/10/31 17:53:20, rdsmith wrote: > Ricardo: I'm getting an error accessing this via the ...

7 years, 1 month ago (2013-10-31 19:06:06 UTC) #3

Randy Smith (Not in Mondays)

Just a quick heads up that I'm going to be leaving soon, and won't be ...

7 years, 1 month ago (2013-10-31 21:01:55 UTC) #4

Randy Smith (Not in Mondays)

Quick question from doing an initial scan through: It's hard to evaluate an interface without ...

7 years, 1 month ago (2013-10-31 21:24:09 UTC) #5

rvargas (doing something else)

On 2013/10/31 21:24:09, rdsmith wrote: > Quick question from doing an initial scan through: It's ...

7 years, 1 month ago (2013-10-31 22:18:57 UTC) #6

On 2013/10/31 21:24:09, rdsmith wrote:
> Quick question from doing an initial scan through: It's hard to evaluate an
> interface without seeing how that interface is going to be used in real code. 
> Is https://codereview.chromium.org/15203004 (diffed against version of trunk
> before this started landing) or https://codereview.chromium.org/17507006 (diff
> against current trunk) still authoritative if I want to look at that?

15203004 is pretty close... it lacks some cleanup changes I made for this
review, but I'll update it to match this code (I'm assuming that you just want
to see the code, not to patch it locally).

The use is basically:
BackendImplV3 owns IndexTable, and initialization is made by:

int BackendImplV3::Init(const CompletionCallback& callback) {
  scoped_refptr<WorkItem> work_item = new WorkItem(WorkItem::WORK_INIT);
  PostWorkItem(work_item);  ---> post a task to the cache thread (BackendWorker)
to open and map the files.
}

void BackendImplV3::OnInitComplete(WorkItem* work_item) {
  // called when the BackendWorker completes initialization... so that
IndexTableInitData is available on the IO thread
  index_.Init(&result.get()->index_data);
}

When the index needs to grow:
void BackendImplV3::GrowIndex() {
  scoped_refptr<WorkItem> work_item = new WorkItem(WorkItem::WORK_GROW_INDEX);
  PostWorkItem(work_item);   ----> same thing, ask the worker to extend the
files and map them again
}

void BackendImplV3::OnGrowIndexComplete(WorkItem* work_item) {
  index_.Init(&result.get()->index_data);   ----> replace the old mappings with
the new one
  work_item->set_flags(WorkItem::WORK_COMPLETE);
  PostWorkItem(work_item);   ----> post a task to release the old mappings
}

To create an entry:
int BackendImplV3::CreateEntry(const std::string& key, Entry** entry,
                               const CompletionCallback& callback) {
  EntrySet entries = index_.LookupEntry(hash);
  + post a task to open the entry if a match is found
}

int BackendImplV3::OnCreateEntryComplete(const std::string& key, uint32 hash,
                                         ShortEntryRecord* short_record,
                                         Entry** entry,
                                         const CompletionCallback& callback) {
  EntryCell cell = index_.CreateEntryCell(hash, entry_address);
  if (!cell.IsValid()) {
  }

  + use the cell.
}

enumerations:
(around BackendImplV3::OpenNext() )

   int lower_limit = index_.CalculateTimestamp(work_item->initial_time());
    if (iterator->timestamp <= lower_limit ||
        !index_.GetNextCells(iterator)) {
      return false;
    }

...
   while (!cells->empty()) {
      EntryCell last_cell = index_.FindEntryCell(cells->back().hash,
                                                 cells->back().address);
      cells->pop_back();
      if (!last_cell.IsValid())
        continue;

Basically the IndexTable returns a group of entries that match a given timestamp
(that has a granularity of 1 minute)... keeps the list around and inspect each
element to iterate or do evictions. The process implies sending tasks to the
worker thread and when that completes verifying that the cell is ok, by calling
FindEntryCell

The other user of the IndexTable API is eviction_v3, which grabs the table from
the backend and uses it to select entries to evict... but then it just tells the
backend to evict a given entry. So basically, all of EvictionV3 + BackendImplV3
+ this CL run on the IO thread, and all thread hoping is done by the backend.

Let me know if you are interested in the thread logic (BackendWorker and
BackendWorkItem)... I assume it's not that relevant at this point. (that is to
say, I'm happy to answer questions, the reference CL is not commented so it may
be hard to follow).

rvargas (doing something else)

> Basically the IndexTable returns a group of entries that match a given timestamp > ...

7 years, 1 month ago (2013-10-31 22:22:09 UTC) #7

Randy Smith (Not in Mondays)

On 2013/10/31 22:18:57, rvargas wrote: > On 2013/10/31 21:24:09, rdsmith wrote: > > Quick question ...

7 years, 1 month ago (2013-11-01 21:51:36 UTC) #8

On 2013/10/31 22:18:57, rvargas wrote:
> On 2013/10/31 21:24:09, rdsmith wrote:
> > Quick question from doing an initial scan through: It's hard to evaluate an
> > interface without seeing how that interface is going to be used in real
code. 
> > Is https://codereview.chromium.org/15203004 (diffed against version of trunk
> > before this started landing) or https://codereview.chromium.org/17507006
(diff
> > against current trunk) still authoritative if I want to look at that?
> 
> 15203004 is pretty close... it lacks some cleanup changes I made for this
> review, but I'll update it to match this code (I'm assuming that you just want
> to see the code, not to patch it locally).

It's easiest for me to navigate code in my usual development environment, so if
you can make it locally patchable, that'd be useful.  If it's a hassle, don't
worry about it.  (Note that it doesn't need to be at ToT, identical to this CL,
or compilable to work well for me--if I can sync to a particular revision,
patch, and have the patch succeed, I'm happy).  

Thanks for the summary below; I'll work on reviewing the CL based on it.

> 
> The use is basically:
> BackendImplV3 owns IndexTable, and initialization is made by:
> 
> int BackendImplV3::Init(const CompletionCallback& callback) {
>   scoped_refptr<WorkItem> work_item = new WorkItem(WorkItem::WORK_INIT);
>   PostWorkItem(work_item);  ---> post a task to the cache thread
(BackendWorker)
> to open and map the files.
> }
> 
> void BackendImplV3::OnInitComplete(WorkItem* work_item) {
>   // called when the BackendWorker completes initialization... so that
> IndexTableInitData is available on the IO thread
>   index_.Init(&result.get()->index_data);
> }
> 
> When the index needs to grow:
> void BackendImplV3::GrowIndex() {
>   scoped_refptr<WorkItem> work_item = new WorkItem(WorkItem::WORK_GROW_INDEX);
>   PostWorkItem(work_item);   ----> same thing, ask the worker to extend the
> files and map them again
> }
> 
> void BackendImplV3::OnGrowIndexComplete(WorkItem* work_item) {
>   index_.Init(&result.get()->index_data);   ----> replace the old mappings
with
> the new one
>   work_item->set_flags(WorkItem::WORK_COMPLETE);
>   PostWorkItem(work_item);   ----> post a task to release the old mappings
> }
> 
> To create an entry:
> int BackendImplV3::CreateEntry(const std::string& key, Entry** entry,
>                                const CompletionCallback& callback) {
>   EntrySet entries = index_.LookupEntry(hash);
>   + post a task to open the entry if a match is found
> }
> 
> int BackendImplV3::OnCreateEntryComplete(const std::string& key, uint32 hash,
>                                          ShortEntryRecord* short_record,
>                                          Entry** entry,
>                                          const CompletionCallback& callback) {
>   EntryCell cell = index_.CreateEntryCell(hash, entry_address);
>   if (!cell.IsValid()) {
>   }
> 
>   + use the cell.
> }
> 
> enumerations:
> (around BackendImplV3::OpenNext() )
> 
>    int lower_limit = index_.CalculateTimestamp(work_item->initial_time());
>     if (iterator->timestamp <= lower_limit ||
>         !index_.GetNextCells(iterator)) {
>       return false;
>     }
> 
> ...
>    while (!cells->empty()) {
>       EntryCell last_cell = index_.FindEntryCell(cells->back().hash,
>                                                  cells->back().address);
>       cells->pop_back();
>       if (!last_cell.IsValid())
>         continue;
> 
> Basically the IndexTable returns a group of entries that match a given
timestamp
> (that has a granularity of 1 minute)... keeps the list around and inspect each
> element to iterate or do evictions. The process implies sending tasks to the
> worker thread and when that completes verifying that the cell is ok, by
calling
> FindEntryCell
> 
> The other user of the IndexTable API is eviction_v3, which grabs the table
from
> the backend and uses it to select entries to evict... but then it just tells
the
> backend to evict a given entry. So basically, all of EvictionV3 +
BackendImplV3
> + this CL run on the IO thread, and all thread hoping is done by the backend.
> 
> Let me know if you are interested in the thread logic (BackendWorker and
> BackendWorkItem)... I assume it's not that relevant at this point. (that is to
> say, I'm happy to answer questions, the reference CL is not commented so it
may
> be hard to follow).

rvargas (doing something else)

> It's easiest for me to navigate code in my usual development environment, so if ...

7 years, 1 month ago (2013-11-01 22:26:06 UTC) #9

rvargas (doing something else)

On 2013/11/01 22:26:06, rvargas wrote: > > It's easiest for me to navigate code in ...

7 years, 1 month ago (2013-11-02 01:19:46 UTC) #10

Randy Smith (Not in Mondays)

Thanks! I'll try to get a first round back today.

7 years, 1 month ago (2013-11-04 16:04:13 UTC) #11

Randy Smith (Not in Mondays)

On 2013/11/04 16:04:13, rdsmith wrote: > Thanks! I'll try to get a first round back ...

7 years, 1 month ago (2013-11-04 22:32:33 UTC) #12

Randy Smith (Not in Mondays)

Quick question below (not blocking review, but I suspect you'll give me the answer faster ...

7 years, 1 month ago (2013-11-05 19:52:25 UTC) #13

rvargas (doing something else)

https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_table.h File net/disk_cache/v3/index_table.h (right): https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_table.h#newcode117 net/disk_cache/v3/index_table.h:117: // combination of hash and address. On 2013/11/05 19:52:26, ...

7 years, 1 month ago (2013-11-05 19:59:52 UTC) #14

Randy Smith (Not in Mondays)

Still working on wrapping my brain around the CL. One questions; What is the intention ...

7 years, 1 month ago (2013-11-05 21:58:21 UTC) #15

Still working on wrapping my brain around the CL.  

One questions; What is the intention for the difference in usage between
EntryCell and IndexCell?  I understand that IndexCell is a struct (so, among
other things, it can't really have methods without a lot of contorting) but I'm
wondering what gets in the way of making EntryCell the way that the IndexCell
information is used anywhere where we're not actually trying to write an
IndexCell in a bucket?  My goal here is bundling all of the functions at the
beginning of index_table.cc that manipulate IndexCells into member functions on
EntryCell and more rarely using IndexCells.  (This question is being asked more
to clarify my thinking around IndexCell/EntryCell than it is a serious
suggestion; I figure I'll learn something important from the answer.)

https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_...
net/disk_cache/v3/index_table.cc:362: memcpy(destination, &cell_,
sizeof(cell_));
nit: Why not assignment?  It would seem simpler.

https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_...
net/disk_cache/v3/index_table.cc:385: if (main_table_) {
What's the difference between this case and |growing == true|?  The bits
comparison makes it look like we're confirming that we've just bumped up the
number of bits in the table by one, i.e. that we're growing.  And this function
is the only place we set either main_table_ or header_.  So it would seem that
this is only different from |growing == true| if params was first passed with a
null main_table_.  When can that happen/what does it mean?

https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_...
net/disk_cache/v3/index_table.cc:390: old_extra_table.reset(new 
IndexBucket[extra_size]);
nit: double whitespace.

https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_...
net/disk_cache/v3/index_table.cc:800: int time = GetCellTimestamp(current_cell);
nit: This value doesn't seem like it's used?

rvargas (doing something else)

On 2013/11/05 21:58:21, rdsmith wrote: > Still working on wrapping my brain around the CL. ...

7 years, 1 month ago (2013-11-05 22:28:44 UTC) #16

rvargas (doing something else)

and the comments https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_table.cc File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_table.cc#newcode362 net/disk_cache/v3/index_table.cc:362: memcpy(destination, &cell_, sizeof(cell_)); On 2013/11/05 21:58:21, ...

7 years, 1 month ago (2013-11-05 22:29:12 UTC) #17

and the comments

https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_...
net/disk_cache/v3/index_table.cc:362: memcpy(destination, &cell_,
sizeof(cell_));
On 2013/11/05 21:58:21, rdsmith wrote:
> nit: Why not assignment?  It would seem simpler.

no reason

https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_...
net/disk_cache/v3/index_table.cc:385: if (main_table_) {
On 2013/11/05 21:58:21, rdsmith wrote:
> What's the difference between this case and |growing == true|?  The bits
> comparison makes it look like we're confirming that we've just bumped up the
> number of bits in the table by one, i.e. that we're growing.  And this
function
> is the only place we set either main_table_ or header_.  So it would seem that
> this is only different from |growing == true| if params was first passed with
a
> null main_table_.  When can that happen/what does it mean?

There are two cases for increasing the size:
- We could be doubling the size of the table (which means changing the mask_
etc)
- We are not extending the main table, just adding more entries to the extra
table.

For example, we can have a 64k main table with maybe 8k entries on the extra
table (max 72k), and we just decided to add another 8k at the end (grow to 80k).
At some point, when we get close to 128k, we double the main table and reset to
a small extra table.

So, |growing| is true every time after the original initialization... but a
main_table will only be passed in when doubling it.

(I'll add a comment with this)

https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_...
net/disk_cache/v3/index_table.cc:390: old_extra_table.reset(new 
IndexBucket[extra_size]);
On 2013/11/05 21:58:21, rdsmith wrote:
> nit: double whitespace.

Done.

https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_...
net/disk_cache/v3/index_table.cc:800: int time = GetCellTimestamp(current_cell);
On 2013/11/05 21:58:21, rdsmith wrote:
> nit: This value doesn't seem like it's used?

Done.

Randy Smith (Not in Mondays)

A couple more interface level questions. I'll start working on reviewing the implementation now. https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_table.h ...

7 years, 1 month ago (2013-11-05 22:31:12 UTC) #18

A couple more interface level questions.  I'll start working on reviewing the
implementation now.

https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_...
File net/disk_cache/v3/index_table.h (right):

https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_...
net/disk_cache/v3/index_table.h:120: struct IndexIterator {
Index iterators look to me to be used in a very specific way, which is to get a
group of cells all of which have the same timestamp, and when passed back in
again to the appropriate function, to get the oldest/youngest group of cells
from the current timestamp.  This seems a general enough concept to be worth
documenting; willing to throw in a comment discussing that here?

https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_...
net/disk_cache/v3/index_table.h:176: int CalculateTimestamp(base::Time time);
The existence of this function (and the fact that it's breaching the abstraction
boundary around index table timestamps) bothers me a little bit.  In an ideal
world, I'd think that either this class interface should only take base::Time
values, or we should define somewhere (maybe just in a comment) the minutes
since base value style of timestamps, and this class interface should only take
those values.

It looks to me from the target CL that the only place this function is (will
eventually be) used for iterator timestamp values, which suggests we could do
the abstraction cleanup by making the interface for IndexIterators be base::Time
(probably plus an IndexTable, possibly in the constructor).  

I'm not strongly in favor of that (there could very easily be complexities
having to do with threading and access to the base value that I'm not seeing),
but I wanted to raise the possibility in case it appealed to you.

https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_...
net/disk_cache/v3/index_table.h:191: // Called each time the index should save
the backup information.
I presume this is part of the class interface because the backup of the index
needs to be synchronized with other backend backup activities.  If that's
accurate, willing to put in a comment to that effect?  (Something like "Exposed
so that index backup can be synchronized with the backup of the IndexBackend
that owns the index."?)

rvargas (doing something else)

https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_table.h File net/disk_cache/v3/index_table.h (right): https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_table.h#newcode120 net/disk_cache/v3/index_table.h:120: struct IndexIterator { On 2013/11/05 22:31:13, rdsmith wrote: > ...

7 years, 1 month ago (2013-11-06 02:23:02 UTC) #19

https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_...
File net/disk_cache/v3/index_table.h (right):

https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_...
net/disk_cache/v3/index_table.h:120: struct IndexIterator {
On 2013/11/05 22:31:13, rdsmith wrote:
> Index iterators look to me to be used in a very specific way, which is to get
a
> group of cells all of which have the same timestamp, and when passed back in
> again to the appropriate function, to get the oldest/youngest group of cells
> from the current timestamp.  This seems a general enough concept to be worth
> documenting; willing to throw in a comment discussing that here?

Done.

https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_...
net/disk_cache/v3/index_table.h:176: int CalculateTimestamp(base::Time time);
On 2013/11/05 22:31:13, rdsmith wrote:
> The existence of this function (and the fact that it's breaching the
abstraction
> boundary around index table timestamps) bothers me a little bit.  In an ideal
> world, I'd think that either this class interface should only take base::Time
> values, or we should define somewhere (maybe just in a comment) the minutes
> since base value style of timestamps, and this class interface should only
take
> those values.
> 
> It looks to me from the target CL that the only place this function is (will
> eventually be) used for iterator timestamp values, which suggests we could do
> the abstraction cleanup by making the interface for IndexIterators be
base::Time
> (probably plus an IndexTable, possibly in the constructor).  
> 
> I'm not strongly in favor of that (there could very easily be complexities
> having to do with threading and access to the base value that I'm not seeing),
> but I wanted to raise the possibility in case it appealed to you.

The issue is that I intentionally wanted to keep the iterators as dumb structs
(even though now they have hidden ctor/dtors for reasons beyond my comprehension
at this time). And the time on the iterator cannot be a high resolution
variable, it has to be a low resolution timestamp, otherwise time comparisons
will fail.

Modifying the iterators to provide comparison functions and keep a pointer back
to the table to do translations seem like a high burden to hide the fact that
the resolution is lower on the table... and having a Time interface may make the
caller assume that things have a higher resolution.

I don't completely follow the first suggestion. Do you mean a typedef of int to
minutes or to timestamp?

https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_...
net/disk_cache/v3/index_table.h:191: // Called each time the index should save
the backup information.
On 2013/11/05 22:31:13, rdsmith wrote:
> I presume this is part of the class interface because the backup of the index
> needs to be synchronized with other backend backup activities.  If that's
> accurate, willing to put in a comment to that effect?  (Something like
"Exposed
> so that index backup can be synchronized with the backup of the IndexBackend
> that owns the index."?)

Kind of. I can add more comments but I want to understand what you mean first. I
considered having a more direct command from the backend instead (DoFoo), but I
wanted to keep the essence of being a general purpose timer tick that can be
used by this class if it wants to (and for whatever it wants to)... and that
happens to be the same timer that other parts of the code use for other things
(which is _sort of_ required for things to work nicely).

... and the timer is provided from outside because it makes testing much easier
than just having an independent timer private to this class (plus seems more
wasteful).

So in terms of the need for synchronization that is just tangential... as in the
backend will delay operations for a while that should be enough for the index to
update the backup... so a number of choices would work for that.

So I don't know exactly how to frame the comment.

Randy Smith (Not in Mondays)

(Just responding to questions/comments; will continue with review.) https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_table.h File net/disk_cache/v3/index_table.h (right): https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_table.h#newcode176 net/disk_cache/v3/index_table.h:176: int ...

7 years, 1 month ago (2013-11-06 18:16:14 UTC) #20

(Just responding to questions/comments; will continue with review.)

https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_...
File net/disk_cache/v3/index_table.h (right):

https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_...
net/disk_cache/v3/index_table.h:176: int CalculateTimestamp(base::Time time);
On 2013/11/06 02:23:03, rvargas wrote:
> The issue is that I intentionally wanted to keep the iterators as dumb structs
> (even though now they have hidden ctor/dtors for reasons beyond my
comprehension
> at this time). 

Why keep them as dumb structs?  I understand the motivation for the disk format
structs, but why the iterator?

> And the time on the iterator cannot be a high resolution
> variable, it has to be a low resolution timestamp, otherwise time comparisons
> will fail.

Yep, I get that.  That's not the same as converting from a high resolution value
on construction or initialization, though (though I think we're clear on this
from your comments below).

> Modifying the iterators to provide comparison functions and keep a pointer
back
> to the table to do translations seem like a high burden to hide the fact that
> the resolution is lower on the table... and having a Time interface may make
the
> caller assume that things have a higher resolution.

Yeah, that makes sense.  I will say that since an IndexIterator only makes sense
in the context of a specific IndexTable (because the minutes it stores are
relevant to that IndexTable), I think a pointer from iterator to table actually
wouldn't be an unreasonable thing.

To be clear, my goal was not to hide the fact that the resolution is lower on
the table--it's to make the interface consistent.  If everything on the index
table took ints that were minutes since some fixed time, I'd be happy.  It's the
fact that consumes of the IndexTable interface both use base::Time and have to
understand and use the minutes-since-header_->base_time concept (in manipulating
the iterators).  

> I don't completely follow the first suggestion. Do you mean a typedef of int
to
> minutes or to timestamp?

typedef of int to minutes would do it, or just a comment explaining that certain
arguments (maybe named index_minutes?  :-}) were minutes since
header_->base_time.  But that's not really ideal either, because
header_->base_time is a concept that's pretty far internal to the IndexTable. 
So this may not be worth it.  I just wanted to call out what I saw as an
abstraction breaking (that the minutes since base time concept escapes outside
the boundary of the IndexTable/IndexIterator interface) and explore whether
there was some way of fixing that.

https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_...
net/disk_cache/v3/index_table.h:191: // Called each time the index should save
the backup information.
On 2013/11/06 02:23:03, rvargas wrote:
> On 2013/11/05 22:31:13, rdsmith wrote:
> > I presume this is part of the class interface because the backup of the
index
> > needs to be synchronized with other backend backup activities.  If that's
> > accurate, willing to put in a comment to that effect?  (Something like
> "Exposed
> > so that index backup can be synchronized with the backup of the IndexBackend
> > that owns the index."?)
> 
> Kind of. I can add more comments but I want to understand what you mean first.
I
> considered having a more direct command from the backend instead (DoFoo), but
I
> wanted to keep the essence of being a general purpose timer tick that can be
> used by this class if it wants to (and for whatever it wants to)... and that
> happens to be the same timer that other parts of the code use for other things
> (which is _sort of_ required for things to work nicely).
> 
> ... and the timer is provided from outside because it makes testing much
easier
> than just having an independent timer private to this class (plus seems more
> wasteful).
> 
> So in terms of the need for synchronization that is just tangential... as in
the
> backend will delay operations for a while that should be enough for the index
to
> update the backup... so a number of choices would work for that.
> 
> So I don't know exactly how to frame the comment.

"Interface exposed for testing, and in case backup of these data structures
should be synchronized with other backup operations."?  I'm mostly reacting to
the fact that when I read through the code I was initially confused as to why
this wasn't internal with its own timer--your reasons make sense to me, but I
think there's some use in making them explicit.

(I'll admit that hearing that the synchronization is "sort of" required for
things to work nicely makes me nervous; that's something I'd very much like to
have explained in its own comment in the code.  But I suspect that comment
doesn't belong here or in this CL.)

rvargas (doing something else)

https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_table.h File net/disk_cache/v3/index_table.h (right): https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_table.h#newcode176 net/disk_cache/v3/index_table.h:176: int CalculateTimestamp(base::Time time); On 2013/11/06 18:16:15, rdsmith wrote: > ...

7 years, 1 month ago (2013-11-07 02:02:07 UTC) #21

https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_...
File net/disk_cache/v3/index_table.h (right):

https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_...
net/disk_cache/v3/index_table.h:176: int CalculateTimestamp(base::Time time);
On 2013/11/06 18:16:15, rdsmith wrote:
> On 2013/11/06 02:23:03, rvargas wrote:
> > The issue is that I intentionally wanted to keep the iterators as dumb
structs
> > (even though now they have hidden ctor/dtors for reasons beyond my
> comprehension
> > at this time). 
> 
> Why keep them as dumb structs?  I understand the motivation for the disk
format
> structs, but why the iterator?

It's just that I see this as a storage for iteration-related data, not really as
an object that is trying to encapsulate something. It's main purpose is to avoid
passing a bunch of separate arguments to all the functions that are involved
with enumerations.

> 
> > And the time on the iterator cannot be a high resolution
> > variable, it has to be a low resolution timestamp, otherwise time
comparisons
> > will fail.
> 
> Yep, I get that.  That's not the same as converting from a high resolution
value
> on construction or initialization, though (though I think we're clear on this
> from your comments below).
> 
> > Modifying the iterators to provide comparison functions and keep a pointer
> back
> > to the table to do translations seem like a high burden to hide the fact
that
> > the resolution is lower on the table... and having a Time interface may make
> the
> > caller assume that things have a higher resolution.
> 
> Yeah, that makes sense.  I will say that since an IndexIterator only makes
sense
> in the context of a specific IndexTable (because the minutes it stores are
> relevant to that IndexTable), I think a pointer from iterator to table
actually
> wouldn't be an unreasonable thing.

it's certainly not unreasonable :)

> 
> To be clear, my goal was not to hide the fact that the resolution is lower on
> the table--it's to make the interface consistent.  If everything on the index
> table took ints that were minutes since some fixed time, I'd be happy.  It's
the
> fact that consumes of the IndexTable interface both use base::Time and have to
> understand and use the minutes-since-header_->base_time concept (in
manipulating
> the iterators).  
> 
> > I don't completely follow the first suggestion. Do you mean a typedef of int
> to
> > minutes or to timestamp?
> 
> typedef of int to minutes would do it, or just a comment explaining that
certain
> arguments (maybe named index_minutes?  :-}) were minutes since
> header_->base_time.  But that's not really ideal either, because
> header_->base_time is a concept that's pretty far internal to the IndexTable. 
> So this may not be worth it.  I just wanted to call out what I saw as an
> abstraction breaking (that the minutes since base time concept escapes outside
> the boundary of the IndexTable/IndexIterator interface) and explore whether
> there was some way of fixing that.

I tried the typedef (Timestamp) but I didn't liked it!. I agree that I don't
want code outside of this class worrying about the actual resolution of the
timestamp... if it is 30 seconds, or 1 minute of whatever... and that's why I
don't want to make it part of the contract that it is actually 1 minute
(although there's a unit test for that)... but I still like the simple idea of
having the caller be aware of the existence of a low resolution time without
trying to hide that detail behind a full iterator interface. So I added a longer
comment.

https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_...
net/disk_cache/v3/index_table.h:191: // Called each time the index should save
the backup information.
On 2013/11/06 18:16:15, rdsmith wrote:
> On 2013/11/06 02:23:03, rvargas wrote:
> > On 2013/11/05 22:31:13, rdsmith wrote:
> > > I presume this is part of the class interface because the backup of the
> index
> > > needs to be synchronized with other backend backup activities.  If that's
> > > accurate, willing to put in a comment to that effect?  (Something like
> > "Exposed
> > > so that index backup can be synchronized with the backup of the
IndexBackend
> > > that owns the index."?)
> > 
> > Kind of. I can add more comments but I want to understand what you mean
first.
> I
> > considered having a more direct command from the backend instead (DoFoo),
but
> I
> > wanted to keep the essence of being a general purpose timer tick that can be
> > used by this class if it wants to (and for whatever it wants to)... and that
> > happens to be the same timer that other parts of the code use for other
things
> > (which is _sort of_ required for things to work nicely).
> > 
> > ... and the timer is provided from outside because it makes testing much
> easier
> > than just having an independent timer private to this class (plus seems more
> > wasteful).
> > 
> > So in terms of the need for synchronization that is just tangential... as in
> the
> > backend will delay operations for a while that should be enough for the
index
> to
> > update the backup... so a number of choices would work for that.
> > 
> > So I don't know exactly how to frame the comment.
> 
> "Interface exposed for testing, and in case backup of these data structures
> should be synchronized with other backup operations."?  I'm mostly reacting to
> the fact that when I read through the code I was initially confused as to why
> this wasn't internal with its own timer--your reasons make sense to me, but I
> think there's some use in making them explicit.
> 
> (I'll admit that hearing that the synchronization is "sort of" required for
> things to work nicely makes me nervous; that's something I'd very much like to
> have explained in its own comment in the code.  But I suspect that comment
> doesn't belong here or in this CL.)

How about now?

Randy Smith (Not in Mondays)

Sorry; a lot of this review is still my climbing the cache learning curve, but ...

7 years, 1 month ago (2013-11-07 20:25:15 UTC) #22

Sorry; a lot of this review is still my climbing the cache learning curve, but I
*am* making progress :-}.   There are a couple of abstractions I want to dig
into further (I mention the overlap in code between GetOldest() and
GetNextCells() below, and I'd like to think about whether there's a way to
better encapsulate the small vs. large table distinction and deal with the
abstraction breakage mentioned in the initial comment, but I haven't quite
understood the breakage yet).  But here's another round of comments, hopefully
moving the process forward.

https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_...
net/disk_cache/v3/index_table.cc:674: bucket_id = GetNextBucket(mask_ + 1,
header()->max_bucket, extra_table_,
It took me a bit of time to figure out why the iteration was being done this
way, rather than just hammering through all the buckets in the index.  So I
think there could usefully be a comment somewhere in this file outlining the
algorithm for iterating through the index, given that that algorithm is used in
several places in the file.  (I thought about suggesting we try to abstract that
algorithm in some fashion, but I couldn't come up with anything that didn't
involve function pointers/callbacks, which didn't strike me as worth it).

https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:208: void UpdateListWithCell(int bucket_hash,
A quick comment describing what this does?  It's non-trivial (i.e. you, or at
least I, needed to read the code closely to understand it), it's not really
self-describing, and it's used as a primitive in another function.  Maybe "If
the passed cell is the same timestamp as |*list_time|, add it to the list.  If
it's an earlier timestamp, replace the list with it and update |*list_time|.

https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:208: void UpdateListWithCell(int bucket_hash,
Why is |bucket_hash| an argument to this function?  It isn't used.

https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:423: extra_table_ = params->extra_table;
Could you put a comment in as to why this is safe to do in the non-doubling
case, but we need to save and copy the cells in the non-doubling case?  (I
presume it's because in the non-doubling case we're memory mapping the same
section of file + some extra, so we don't lose anything, but if we're doubling
we want to copy the stuff that used to be extra into the main section, but I
think that's worth calling out.)

https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:434: int num_words = (header_->table_len + 31)
/ 32;
table_len is in bits?  I wouldn't have naturally assumed that; might be worth a
comment somewhere.

https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:874: if ((iterator->forward && time <
iterator->timestamp) ||
This next section of code looks to have a lot of overlap with
UpdateListFromCells; might it be worthwhile to add a directional argument to
that function and re-use it here?  Alternatively, would it make sense to change
UpdateListWithCell operate on iterators, and use iterators in the
GetOldest/GetOldestFromBucket logic?  That might simplify the calling signature
for GetOldestFromBucket because the iterators would combine the *_use and
*_use_time arguments.

(I'm feeling like there's a fair amount of overlap between the functionality of
GetOldest/GetNextCells and looking for ways to combine code--I'm also wondering
if there might be some way to combine GetNewestFromBucket() and
GetOldestFromBucket().  If you like one of the above suggestions I may have
other ideas after I see that code.)

https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.h (right):

https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:124: // the oldest / youngest group of cells
from the current timestamp.
This doesn't seem quite right to me / doesn't match the code as I read it.  As I
read the code, what's happening is that when you call GetNextCells, you note the
current timestamp, trash the current set of cells, and get all cells that are in
the next timestamp that has any cells in it in the direction you're going.  That
right?  If so, I find "from the current timestamp" confusing; it's really from
the next oldest/youngest timestamp.  

It also looks to me as if currently, GetNextCells() is the only function that
takes/uses iterators.  If that's true and is likely to stay true for a while,
referring directly to that function would make sense to me. 

I also have a slight tug towards making this a nested class for IndexTable,
since it's only ever used with IndexTable.  I remember that nested classes are
discouraged, but my memory as to why is that often their definitions aren't
needed in the header files, and this definition is in the header file anyway. 
But it might be worth avoiding because it's not a common pattern, so this is
just a suggestion.

https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:130: int timestamp;  // The current low
resolution timestamp for |cells|.
Worth having a reference to IndexTable::CalculateTimestamp?

https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:178: EntrySet LookupEntry(uint32 hash);
nit, suggestion: This is a bit dependent on the definition of "Entry", but to
me, it's looking up multiple entries (all entries that share the same hash).  So
based on my current understanding of the abstractions, I think it would be
better as LookupEntries().  WDYT?

https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:189: // of cells will share the same timestamp.
The comment looks good to me.  Is it worthwhile having a reference to
IndexIterator here, since that's the only place that the consumer would use the
low resolution timestamp?

https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:224: int limit_time, IndexIterator* iterator);
I'm bothered by the parallel names here for two functions that do fairly
different things.  Maybe rename GetNewestFromBucket to UpdateIteratorFromBucket?

rvargas (doing something else)

Thanks. https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_table.cc File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_table.cc#newcode674 net/disk_cache/v3/index_table.cc:674: bucket_id = GetNextBucket(mask_ + 1, header()->max_bucket, extra_table_, On ...

7 years, 1 month ago (2013-11-08 04:22:29 UTC) #23

Thanks.

https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_...
net/disk_cache/v3/index_table.cc:674: bucket_id = GetNextBucket(mask_ + 1,
header()->max_bucket, extra_table_,
On 2013/11/07 20:25:15, rdsmith wrote:
> It took me a bit of time to figure out why the iteration was being done this
> way, rather than just hammering through all the buckets in the index.  So I

What do you mean by hammering through all the buckets? I think we are hammering
through all the buckets :). It's just that the structure of the table implies
that to locate something on the extra table one has to start on a given bucket
on the main table. So we could ignore the extra buckets when looking at the main
table, but we really cannot ignore the main table when looking at the extra
table. Is that what you mean?

> think there could usefully be a comment somewhere in this file outlining the
> algorithm for iterating through the index, given that that algorithm is used
in
> several places in the file.  (I thought about suggesting we try to abstract
that
> algorithm in some fashion, but I couldn't come up with anything that didn't
> involve function pointers/callbacks, which didn't strike me as worth it).  

I think I went through that same phase at some point, but the conclusion was the
same: attempting to generalize the code was not worth it / would just complicate
things.

https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:208: void UpdateListWithCell(int bucket_hash,
On 2013/11/07 20:25:15, rdsmith wrote:
> A quick comment describing what this does?  It's non-trivial (i.e. you, or at
> least I, needed to read the code closely to understand it), it's not really
> self-describing, and it's used as a primitive in another function.  Maybe "If
> the passed cell is the same timestamp as |*list_time|, add it to the list.  If
> it's an earlier timestamp, replace the list with it and update |*list_time|.

will do

https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:208: void UpdateListWithCell(int bucket_hash,
On 2013/11/07 20:25:15, rdsmith wrote:
> Why is |bucket_hash| an argument to this function?  It isn't used.

Sorry about that... it's just the result of refactoring the code.

https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:423: extra_table_ = params->extra_table;
On 2013/11/07 20:25:15, rdsmith wrote:
> Could you put a comment in as to why this is safe to do in the non-doubling
> case, but we need to save and copy the cells in the non-doubling case?  (I
> presume it's because in the non-doubling case we're memory mapping the same
> section of file + some extra, so we don't lose anything, but if we're doubling
> we want to copy the stuff that used to be extra into the main section, but I
> think that's worth calling out.)

Will do, and you're correct.

https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:434: int num_words = (header_->table_len + 31)
/ 32;
On 2013/11/07 20:25:15, rdsmith wrote:
> table_len is in bits?  I wouldn't have naturally assumed that; might be worth
a
> comment somewhere.

table_len is in cells (aka, buckets * 4). But we need to store a bitmap, which
has one bit per cell, so this is how many 32-bit words that bitmap needs.

I'll add a comment.

https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:874: if ((iterator->forward && time <
iterator->timestamp) ||
On 2013/11/07 20:25:15, rdsmith wrote:
> This next section of code looks to have a lot of overlap with
> UpdateListFromCells; might it be worthwhile to add a directional argument to
> that function and re-use it here?  Alternatively, would it make sense to
change
> UpdateListWithCell operate on iterators, and use iterators in the
> GetOldest/GetOldestFromBucket logic?  That might simplify the calling
signature
> for GetOldestFromBucket because the iterators would combine the *_use and
> *_use_time arguments.
> 
> (I'm feeling like there's a fair amount of overlap between the functionality
of
> GetOldest/GetNextCells and looking for ways to combine code--I'm also
wondering
> if there might be some way to combine GetNewestFromBucket() and
> GetOldestFromBucket().  If you like one of the above suggestions I may have
> other ideas after I see that code.)

I'll look at it again. This was the main part where I mentioned that I
considered ways to extract code into a single implementation method but I
couldn't come up with anything that was simpler (worth it)... as in at least
this way is easier to understand.

https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.h (right):

https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:124: // the oldest / youngest group of cells
from the current timestamp.
On 2013/11/07 20:25:15, rdsmith wrote:
> This doesn't seem quite right to me / doesn't match the code as I read it.  As
I
> read the code, what's happening is that when you call GetNextCells, you note
the
> current timestamp, trash the current set of cells, and get all cells that are
in
> the next timestamp that has any cells in it in the direction you're going. 
That
> right?  If so, I find "from the current timestamp" confusing; it's really from
> the next oldest/youngest timestamp.

I'll clarify the comment. It doesn't mean that the caller sets the timestamp and
the callee returns cells that match it... it means that when the iterator is
returned, the cells in the iterator share the timestamp (as it is set on
return). So GetNext(from 0) will returns cells with 1, or 2 or n as timestamp,
not 0.

I'll also link the comment with the API.

As for a nested class, I don't mind it either way, but I know the guide applies
even to Delegates, which are tightly tied to the class that declares it and are
always going to be part of the same header... and we have lots of nested
delegates, but AFAIK we want to go in the other direction. But given that I
don't mind either way I'll leave it up to you.

> 
> It also looks to me as if currently, GetNextCells() is the only function that
> takes/uses iterators.  If that's true and is likely to stay true for a while,
> referring directly to that function would make sense to me. 
> 
> I also have a slight tug towards making this a nested class for IndexTable,
> since it's only ever used with IndexTable.  I remember that nested classes are
> discouraged, but my memory as to why is that often their definitions aren't
> needed in the header files, and this definition is in the header file anyway. 
> But it might be worth avoiding because it's not a common pattern, so this is
> just a suggestion.

https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:178: EntrySet LookupEntry(uint32 hash);
On 2013/11/07 20:25:15, rdsmith wrote:
> nit, suggestion: This is a bit dependent on the definition of "Entry", but to
> me, it's looking up multiple entries (all entries that share the same hash). 
So
> based on my current understanding of the abstractions, I think it would be
> better as LookupEntries().  WDYT?

sure

rvargas (doing something else)

https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index_table.cc File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index_table.cc#newcode874 net/disk_cache/v3/index_table.cc:874: if ((iterator->forward && time < iterator->timestamp) || On 2013/11/08 ...

7 years, 1 month ago (2013-11-08 23:58:18 UTC) #24

Randy Smith (Not in Mondays)

Still reviewing, but this is a bundle of comments all about iteration through the table, ...

7 years, 1 month ago (2013-11-11 20:54:53 UTC) #25

Still reviewing, but this is a bundle of comments all about iteration through
the table, and so I figure I can share these and we can work in parallel.

https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_...
net/disk_cache/v3/index_table.cc:674: bucket_id = GetNextBucket(mask_ + 1,
header()->max_bucket, extra_table_,
On 2013/11/08 04:22:30, rvargas wrote:
> On 2013/11/07 20:25:15, rdsmith wrote:
> > It took me a bit of time to figure out why the iteration was being done this
> > way, rather than just hammering through all the buckets in the index.  So I
> 
> What do you mean by hammering through all the buckets? 

Well, what I meant was |for (int i = 0; i < header()->max_bucket(); i++))| and
removing the call to GetNextBucket().  But I think that was based on an
incorrect understanding of the layout (i.e. that everything was contiguous in
main_table_; I hadn't put together the passing of extra_table_ with how
GetNextBucket() works).  Ooops.  Ok, let me reflect back what I now think is
going on:
* main_table_ points to a table of size mask_ + 1.  That grows by doubling only.
* extra_table_ points to growth in the table when we don't need enough extra
entries to double the main table.  This is "easier" growth, since it just means
extending the extra_table file, and everything else (other than the size of that
file/table) stays unaltered.  There's a bit of a change when we switch pointers
from the old mapping of the extra table file to the new mapping, but because
they map the same file, it's fine for them to both be active for a bit.
* All bucket chains start in the main table, but all buckets except the first in
any chain will be in the extra table.

> I think we are hammering
> through all the buckets :). It's just that the structure of the table implies
> that to locate something on the extra table one has to start on a given bucket
> on the main table. So we could ignore the extra buckets when looking at the
main
> table, but we really cannot ignore the main table when looking at the extra
> table. Is that what you mean?

Based on my new understanding I'll modify my thought (not even a suggestion;
just the alternative algorithm I went through in getting to my current
understanding).  Could we write the above loop as?:

for (int i = 0; i < static_cast<int32>(mask_ + 1); i++) 
  GetNewestFromBucket(&main_table_[i], i, current_time, iterator);

for (int i = static_cast<int32>(mask_ + 1); i < header()->max_bucket; i++) {
  IndexBucket* bucket = &extra_table_[i - static_cast<int32>(mask_ + 1)];
  GetNewestFromBucket(bucket, bucket->hash, current_time, iterator);
}

??  Not saying we should, just trying to make sure I understand the data
structure format.

> 
> > think there could usefully be a comment somewhere in this file outlining the
> > algorithm for iterating through the index, given that that algorithm is used
> in
> > several places in the file.  (I thought about suggesting we try to abstract
> that
> > algorithm in some fashion, but I couldn't come up with anything that didn't
> > involve function pointers/callbacks, which didn't strike me as worth it).  
> 
> I think I went through that same phase at some point, but the conclusion was
the
> same: attempting to generalize the code was not worth it / would just
complicate
> things.

Makes sense.  However, if I'm struggling as much as I am with getting solidly on
top of the iteration method and it's rationale, I'd like to have a comment
somewhere that sketches those two things out.  Is there such a comment?  If not,
would you be willing to put one in?

https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_...
File net/disk_cache/v3/index_table.h (right):

https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_...
net/disk_cache/v3/index_table.h:176: int CalculateTimestamp(base::Time time);
On 2013/11/07 02:02:07, rvargas wrote:
> On 2013/11/06 18:16:15, rdsmith wrote:
> > On 2013/11/06 02:23:03, rvargas wrote:
> > > The issue is that I intentionally wanted to keep the iterators as dumb
> structs
> > > (even though now they have hidden ctor/dtors for reasons beyond my
> > comprehension
> > > at this time). 
> > 
> > Why keep them as dumb structs?  I understand the motivation for the disk
> format
> > structs, but why the iterator?
> 
> It's just that I see this as a storage for iteration-related data, not really
as
> an object that is trying to encapsulate something. It's main purpose is to
avoid
> passing a bunch of separate arguments to all the functions that are involved
> with enumerations.

Hmmm.  Makes sense.  I think part of why I pull in the other direction is just
because I find it useful to conceptually bundle groups of related functions
together.  Specifically InitIterator and UpdateIterator are very much "work with
the iterator data structure", so I feel a pull to declare them as methods of
iterator.  (InitIterator() could be the Iterator constructor with just a bit of
tweaking).  I could imagine those being fine as methods on a struct, but that's
a slippery slope to full class status.  So it's your call.

https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.h (right):

https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:224: int limit_time, IndexIterator* iterator);
On 2013/11/07 20:25:15, rdsmith wrote:
> I'm bothered by the parallel names here for two functions that do fairly
> different things.  Maybe rename GetNewestFromBucket to
UpdateIteratorFromBucket?

I still feel this way; what are your thoughts?

https://codereview.chromium.org/53313004/diff/1330001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1330001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:904: }
I like this refactor.  I'll note that it could be taken one step further, though
I'm not certain without looking at the code whether that would be a good thing
or not, so I'll just offer it as an idea for you to consider.  The key function
would be one that:

* Took as arguments:
** no_use, low_use, and high_use iterators
** Pointers to counters for those three cases + evicted entries.
** A limit time.
* Cycled through all buckets and all cells.
* Updated iterators according to the use type of the cell and the limit time.

That would be called by both GetOldest and GetNextCells; GetNextCells() would
pass in the same iterator for all three cases, and null pointers for the
counting, and GetOldest() would reverse that.  To me the major con would be that
the interface for the function would be pretty big (eight arguments).  Pros
would be collapsing the above two functions and simplifying GetOldest() and
GetNextCells(), and making GetOldestFromBucket() not modify IndexTable (it looks
like a getter; the mutating is a little surprising).  

I could imagine expanding iterator to keep a count of cells cycled through it,
which would get rid of three arguments, and possibly turning the iterator
argument into an array.  But as I say, that all may or may not be worth it, so
I'll leave that up to you.

https://codereview.chromium.org/53313004/diff/1330001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.h (right):

https://codereview.chromium.org/53313004/diff/1330001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:124: // the oldest / youngest group of cells
from the current timestamp. When this
nit, suggestion: "the oldest / youngest" -> "the next older / younger"?

https://codereview.chromium.org/53313004/diff/1330001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:127: // timestamp for them.
I'm still finding this comment a bit confusing, both in the references to
functions above ("appropriate function" vs. "GetNextCells") and in the
description of what IndexIterator does and how its used.  The only interfaces
that this can be used with by the consumer are GetNextCells() and GetOldest(),
so if it's consumer targeted we should either refer to both those functions or
make an abstract reference to what they each do.  So I think I'd drop back a
couple of yards, and describe what the iterator is/can be used for in the
context of those functions.  How about something like:

"An index iterator holds a group of cells, all of which have the same timestamp.
 It can be used to iterate through the cells in an IndexTable  in a given
direction (forward or backward in time) via calls to GetNextCells(), returning
on each call all the cells with the same timestamp."

(See below as to why I'm leaving out GetOldest().)

https://codereview.chromium.org/53313004/diff/1330001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:133: int timestamp;  // The current low
resolution timestamp for |cells|.
Did you have any reaction to my suggestion of cross references to/from
CalculateTimestamp below?

https://codereview.chromium.org/53313004/diff/1330001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:209: IndexIterator* high_use);
The implementation of this function initializes all the iterators and set the
forward variable, which I believe means that the caller really doesn't care
about the |forward| and |timestamp| portions of the IndexIterator.  Looking at 
https://codereview.chromium.org/17507006, it looks like the interface to this
function used to be CellLists.  Why isn't it in this CL?  That would seem to
capture the information being passed back to the caller more precisely.

rvargas (doing something else)

Thanks https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_table.cc File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_table.cc#newcode674 net/disk_cache/v3/index_table.cc:674: bucket_id = GetNextBucket(mask_ + 1, header()->max_bucket, extra_table_, On ...

7 years, 1 month ago (2013-11-12 00:24:03 UTC) #26

Thanks

https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/190001/net/disk_cache/v3/index_...
net/disk_cache/v3/index_table.cc:674: bucket_id = GetNextBucket(mask_ + 1,
header()->max_bucket, extra_table_,
On 2013/11/11 20:54:54, rdsmith wrote:
> On 2013/11/08 04:22:30, rvargas wrote:
> > On 2013/11/07 20:25:15, rdsmith wrote:
> > > It took me a bit of time to figure out why the iteration was being done
this
> > > way, rather than just hammering through all the buckets in the index.  So
I
> > 
> > What do you mean by hammering through all the buckets? 
> 
> Well, what I meant was |for (int i = 0; i < header()->max_bucket(); i++))| and
> removing the call to GetNextBucket().  But I think that was based on an
> incorrect understanding of the layout (i.e. that everything was contiguous in
> main_table_; I hadn't put together the passing of extra_table_ with how
> GetNextBucket() works).  Ooops.  Ok, let me reflect back what I now think is
> going on:
> * main_table_ points to a table of size mask_ + 1.  That grows by doubling
only.
> * extra_table_ points to growth in the table when we don't need enough extra
> entries to double the main table.  This is "easier" growth, since it just
means
> extending the extra_table file, and everything else (other than the size of
that
> file/table) stays unaltered.  There's a bit of a change when we switch
pointers
> from the old mapping of the extra table file to the new mapping, but because
> they map the same file, it's fine for them to both be active for a bit.
> * All bucket chains start in the main table, but all buckets except the first
in
> any chain will be in the extra table.

Correct

> 
> > I think we are hammering
> > through all the buckets :). It's just that the structure of the table
implies
> > that to locate something on the extra table one has to start on a given
bucket
> > on the main table. So we could ignore the extra buckets when looking at the
> main
> > table, but we really cannot ignore the main table when looking at the extra
> > table. Is that what you mean?
> 
> Based on my new understanding I'll modify my thought (not even a suggestion;
> just the alternative algorithm I went through in getting to my current
> understanding).  Could we write the above loop as?:
> 
> for (int i = 0; i < static_cast<int32>(mask_ + 1); i++) 
>   GetNewestFromBucket(&main_table_[i], i, current_time, iterator);
> 
> for (int i = static_cast<int32>(mask_ + 1); i < header()->max_bucket; i++) {
>   IndexBucket* bucket = &extra_table_[i - static_cast<int32>(mask_ + 1)];
>   GetNewestFromBucket(bucket, bucket->hash, current_time, iterator);
> }
> 
> ??  Not saying we should, just trying to make sure I understand the data
> structure format.
> 

Correct

> > 
> > > think there could usefully be a comment somewhere in this file outlining
the
> > > algorithm for iterating through the index, given that that algorithm is
used
> > in
> > > several places in the file.  (I thought about suggesting we try to
abstract
> > that
> > > algorithm in some fashion, but I couldn't come up with anything that
didn't
> > > involve function pointers/callbacks, which didn't strike me as worth it). 

> > 
> > I think I went through that same phase at some point, but the conclusion was
> the
> > same: attempting to generalize the code was not worth it / would just
> complicate
> > things.
> 
> Makes sense.  However, if I'm struggling as much as I am with getting solidly
on
> top of the iteration method and it's rationale, I'd like to have a comment
> somewhere that sketches those two things out.  Is there such a comment?  If
not,
> would you be willing to put one in?

Will do.

https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_...
File net/disk_cache/v3/index_table.h (right):

https://codereview.chromium.org/53313004/diff/520001/net/disk_cache/v3/index_...
net/disk_cache/v3/index_table.h:176: int CalculateTimestamp(base::Time time);
On 2013/11/11 20:54:54, rdsmith wrote:
> On 2013/11/07 02:02:07, rvargas wrote:
> > On 2013/11/06 18:16:15, rdsmith wrote:
> > > On 2013/11/06 02:23:03, rvargas wrote:
> > > > The issue is that I intentionally wanted to keep the iterators as dumb
> > structs
> > > > (even though now they have hidden ctor/dtors for reasons beyond my
> > > comprehension
> > > > at this time). 
> > > 
> > > Why keep them as dumb structs?  I understand the motivation for the disk
> > format
> > > structs, but why the iterator?
> > 
> > It's just that I see this as a storage for iteration-related data, not
really
> as
> > an object that is trying to encapsulate something. It's main purpose is to
> avoid
> > passing a bunch of separate arguments to all the functions that are involved
> > with enumerations.
> 
> Hmmm.  Makes sense.  I think part of why I pull in the other direction is just
> because I find it useful to conceptually bundle groups of related functions
> together.  Specifically InitIterator and UpdateIterator are very much "work
with
> the iterator data structure", so I feel a pull to declare them as methods of
> iterator.  (InitIterator() could be the Iterator constructor with just a bit
of
> tweaking).  I could imagine those being fine as methods on a struct, but
that's
> a slippery slope to full class status.  So it's your call.

I thought you were going to say something like that :)

The thing is that both methods would be private to the iterator, because what
they do is an implementation detail of the code that returns iterators... is not
something that the reader of the header (user of the class) should worry
about... or ever directly use. That leads directly to making the struct a class,
with a friend, and for the caller to benefit it would now need actual public
methods thrown into the mix.

I think I still prefer the trivial, fully open view for the caller.

https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.h (right):

https://codereview.chromium.org/53313004/diff/1070001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:224: int limit_time, IndexIterator* iterator);
On 2013/11/11 20:54:54, rdsmith wrote:
> On 2013/11/07 20:25:15, rdsmith wrote:
> > I'm bothered by the parallel names here for two functions that do fairly
> > different things.  Maybe rename GetNewestFromBucket to
> UpdateIteratorFromBucket?
> 
> I still feel this way; what are your thoughts?

Sorry, I missed this after the last refactor.

https://codereview.chromium.org/53313004/diff/1330001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1330001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:904: }
On 2013/11/11 20:54:54, rdsmith wrote:
> I like this refactor.  I'll note that it could be taken one step further,
though
> I'm not certain without looking at the code whether that would be a good thing
> or not, so I'll just offer it as an idea for you to consider.  The key
function
> would be one that:
> 
> * Took as arguments:
> ** no_use, low_use, and high_use iterators
> ** Pointers to counters for those three cases + evicted entries.
> ** A limit time.
> * Cycled through all buckets and all cells.
> * Updated iterators according to the use type of the cell and the limit time.
> 
> That would be called by both GetOldest and GetNextCells; GetNextCells() would
> pass in the same iterator for all three cases, and null pointers for the
> counting, and GetOldest() would reverse that.  To me the major con would be
that
> the interface for the function would be pretty big (eight arguments).  Pros
> would be collapsing the above two functions and simplifying GetOldest() and
> GetNextCells(), and making GetOldestFromBucket() not modify IndexTable (it
looks
> like a getter; the mutating is a little surprising).  
> 
> I could imagine expanding iterator to keep a count of cells cycled through it,
> which would get rid of three arguments, and possibly turning the iterator
> argument into an array.  But as I say, that all may or may not be worth it, so
> I'll leave that up to you.

My first reaction was not to grow a do-all method. However, one method was
basically a subset of the other one, so merged.

https://codereview.chromium.org/53313004/diff/1330001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.h (right):

https://codereview.chromium.org/53313004/diff/1330001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:127: // timestamp for them.
On 2013/11/11 20:54:54, rdsmith wrote:
> I'm still finding this comment a bit confusing, both in the references to
> functions above ("appropriate function" vs. "GetNextCells") and in the
> description of what IndexIterator does and how its used.  The only interfaces
> that this can be used with by the consumer are GetNextCells() and GetOldest(),
> so if it's consumer targeted we should either refer to both those functions or
> make an abstract reference to what they each do.  So I think I'd drop back a
> couple of yards, and describe what the iterator is/can be used for in the
> context of those functions.  How about something like:
> 
> "An index iterator holds a group of cells, all of which have the same
timestamp.
>  It can be used to iterate through the cells in an IndexTable  in a given
> direction (forward or backward in time) via calls to GetNextCells(), returning
> on each call all the cells with the same timestamp."
> 
> (See below as to why I'm leaving out GetOldest().)

Another attempt :)

https://codereview.chromium.org/53313004/diff/1330001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:133: int timestamp;  // The current low
resolution timestamp for |cells|.
On 2013/11/11 20:54:54, rdsmith wrote:
> Did you have any reaction to my suggestion of cross references to/from
> CalculateTimestamp below?

I don't particularly want to cross reference individual members... and the
structure itself is already cross referenced with the methods that return it; I
added some reference to the timestamp on the description of GetNextCells. I'm
adding an explicit reference to the iterator from CalculateTimestamp.

Note that I'm not really trying to hide the existence of a table/cell timestamp,
so both the table and the cell have methods that return a timestamp.

https://codereview.chromium.org/53313004/diff/1330001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:209: IndexIterator* high_use);
On 2013/11/11 20:54:54, rdsmith wrote:
> The implementation of this function initializes all the iterators and set the
> forward variable, which I believe means that the caller really doesn't care
> about the |forward| and |timestamp| portions of the IndexIterator.  Looking at

> https://codereview.chromium.org/17507006, it looks like the interface to this
> function used to be CellLists.  Why isn't it in this CL?  That would seem to
> capture the information being passed back to the caller more precisely.

Up to patch set 4, this method returned just the cells. However, unifying what
to do with a cell when doing enumerations meant that the signature of this
method had to change because otherwise the whole returned lists would have to be
copied just before returning (an iterator cannot keep a reference to a list, it
is the owner of the list). OK, I could swap the lists at that time... but the
caller is keeping the timestamps anyway because they are needed to select what
to delete, so the only thing that is really not used is the direction... or to
view it from the other side, this method returns three iterators, it just
ignores their initial values. I believe the reason this method was using lists
directly is that I probably wrote it before creating the iterators.

Randy Smith (Not in Mondays)

The iteration stuff all looks good to me (with the nits below around the comment ...

7 years, 1 month ago (2013-11-13 21:14:12 UTC) #27

The iteration stuff all looks good to me (with the nits below around the comment
on GetNextCells()).  

I'm sorry this is taking so long, but I really do think we're making progress. 
That progress may be primarily measured in sections of the cache that I'm
successfully wrapping my brain around, but that's a progress measure :-}.  I
think once I successfully wrap my brain around the address formats and the small
vs. large table distinction I'll be in a position to do one large push and
finish up the review.

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:250: Addr EntryCell::GetAddress() const {
I spent some time wrapping my brain around this function trying to understand
the format difference between the small table and the large table.  From
disk_format_v3.h the only difference is the number of bits allocated to the hash
and the address (small->large hash/addr: 16/16 -> 10/22).  But looking at this
function, it looks like we're also shifting from  situation where we aren't
using separate files (i.e. everything is always in either the entries file or
the evicted entries file) to one in which the file is encoded in the address
from the IndexCell.  This right?

If so, there looks to be a subtle contradiction between two different format
descriptions.  addr.h indicates taht it's reserving 8 bits in a CacheAddr for
the file selector.  But the specification in disk_format_v3.h implies that we're
only using six of those eight bits (if the bottom 16 bits are for the block,
that only leaves six bits for the file).  Is that right?

So a general concern (sorta implied by how long it took me to puzzle out the
above) is around the documentation of the different forms of address and their
relationship to each other, and the documentation of the difference in file
usage for small/large table size.  Is this done somewhere and I missed it?  If
not, could it be done somewhere?

(After writing this, I went back and read the comment at the top of the file,
which I hadn't understood on my first pass through, but now that I've re-read it
having better grokked the file formats and small/large table distinction I
understand better.  But I still feels like it should be easier to really grok
the file formats from reading through addr.h, disk_format_base.h, and
disk_format_v3.h, though that may be just my failure in stumbling through that
documentation in the wrong order.)

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:250: Addr EntryCell::GetAddress() const {
Another random table format question: How do you get an Addr() with number of
contiguous blocks != 1?  If you have such a thing, how do you store it in the
bits allocated for the address in an IndexCell?

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:262: return
Addr::FromEntryAddress(address_value);
For both these calls, it feels like "From*Address" is a bit of a misnomer, both
because "address" means several different things in this code, and because the
address_value means different things in the small and large table formats. 
Maybe From*FileIndexAddress?

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.h (right):

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:186: EntryCell FindEntryCell(uint32 hash, Addr
address);
I tried to grok the different usages of CreateEntryCell() and FindEntryCell() by
looking at the reference v3 CL, but I couldn't get it.  They look different
enough that my initial thought of "Would a single FindOrCreate interface work?"
almost certainly wasn't a good one.  But I'm not certain I understand why; could
you give me a sense as to how these will usually be used?  Specifically, it
seems like if you're doing a CreateEntryCell(), you've already done the hard
work if you've figured out the address, so it sounds like CreateEntryCell should
only be called from code that actually does allocation within the cache (which
is the backend??).  FindEntryCell you can do when you have the address, but
where are you likely to have gotten the address from, if not the cache?

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:205: // for the provided iterators are ignored.
Where is the boundary between low usage and high usage defined?

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:211: // that this method does not return the
cells matching the provided timestamp,
nit(s): I'd leave out the "keep in mind" since I don't think that information is
actually given anywhere else, change "the provided timestamp" to "the iterator's
timestamp", and change "(or before)" to "(or before, depending on the value of
the iterator's forward member)".  With that, I think this documentation + the
iterator documentation is good; are those changes ok with you?

rvargas (doing something else)

Thanks https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index_table.cc File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index_table.cc#newcode250 net/disk_cache/v3/index_table.cc:250: Addr EntryCell::GetAddress() const { On 2013/11/13 21:14:12, rdsmith ...

7 years, 1 month ago (2013-11-14 02:54:19 UTC) #28

Thanks

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:250: Addr EntryCell::GetAddress() const {
On 2013/11/13 21:14:12, rdsmith wrote:
> I spent some time wrapping my brain around this function trying to understand
> the format difference between the small table and the large table.  From
> disk_format_v3.h the only difference is the number of bits allocated to the
hash
> and the address (small->large hash/addr: 16/16 -> 10/22).  But looking at this
> function, it looks like we're also shifting from  situation where we aren't
> using separate files (i.e. everything is always in either the entries file or
> the evicted entries file) to one in which the file is encoded in the address
> from the IndexCell.  This right?

The issue is that the small table format only supports up to 64k entries. 64k
entries is also roughly the limit of entries that can be stored by a single
block file, and given that other parts of the design assign a specific file type
for entries, the net result is that the block file type and number are fixed for
the small table case. It is a case of implied bits that don't have to be stored.

Large tables of course support more than 64k entries, and that means that the
file number is also part of what changes in the address, and has to be stored.

Now, ideally, an Address should be transparent for this code... and from the
point of view of the index format it shouldn't really matter what specific bits
of that address mean. On the other hand, the Address itself shouldn't care if it
is being stored by an index or not. In other words, either part of the
definition of Address leaks here, or details of the index leak into Address.

So what this code says is:
- for small tables, _I_ (this method) know that some bits here are fixed, and
the Address API gives me a clear way to set them as needed. 
- for large tables, _I_ know how many bits I have on the table, and I know that
they are not enough bits to overflow the "control" part of an Address, but I'll
let the Address create the right object though a dedicated method because the
normal API doesn't have what I need.

We could add more methods to Address to make it aware of small tables (or add
more arguments to the new methods), but between the two files I think it is
better to leak from Address to Index than the other way. I could use
Addr::FromEvictedAddress even for small tables after massaging the value, but I
don't think that's what you want.

> 
> If so, there looks to be a subtle contradiction between two different format
> descriptions.  addr.h indicates that it's reserving 8 bits in a CacheAddr for
> the file selector.  But the specification in disk_format_v3.h implies that
we're
> only using six of those eight bits (if the bottom 16 bits are for the block,
> that only leaves six bits for the file).  Is that right?

Sort of. The low order 24 bits of an address are "pure" addressing bits in that
they represent the addressing capacity of the format. The upper bits are some
sort of control, but an address cannot enumerate more than 2^24 entities.

BlockFiles is the one who really cares about the distinction between the low
order 16 bits and the file number.

So, overall, this code is saying that it can store up to 2^22 entries (active +
deleted) so the high order bits of the file number are always going to be 0.

> 
> So a general concern (sorta implied by how long it took me to puzzle out the
> above) is around the documentation of the different forms of address and their
> relationship to each other, and the documentation of the difference in file
> usage for small/large table size.  Is this done somewhere and I missed it?  If
> not, could it be done somewhere?

Sorry about that. There is some text in disk_format_v3.h about what an index
address is and how that is translated to Addr, but I expanded that comment with
an example, and also extended the description on addr.h

> 
> (After writing this, I went back and read the comment at the top of the file,
> which I hadn't understood on my first pass through, but now that I've re-read
it
> having better grokked the file formats and small/large table distinction I
> understand better.  But I still feels like it should be easier to really grok
> the file formats from reading through addr.h, disk_format_base.h, and
> disk_format_v3.h, though that may be just my failure in stumbling through that
> documentation in the wrong order.)

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:250: Addr EntryCell::GetAddress() const {
On 2013/11/13 21:14:12, rdsmith wrote:
> Another random table format question: How do you get an Addr() with number of
> contiguous blocks != 1?  If you have such a thing, how do you store it in the
> bits allocated for the address in an IndexCell?

That case corresponds for example to storing 800 bytes of data somewhere. In
that case, block size is 256 and there are 4 blocks used for the record.

The index table only stores records of active or deleted entries, and they have
a fixed size, so they will always use just one block.

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:262: return
Addr::FromEntryAddress(address_value);
On 2013/11/13 21:14:12, rdsmith wrote:
> For both these calls, it feels like "From*Address" is a bit of a misnomer,
both
> because "address" means several different things in this code, and because the
> address_value means different things in the small and large table formats. 
> Maybe From*FileIndexAddress?

The name follows what is available on addr.h :( I can change it if you feel
strongly about it, but the only things Addr knows about is BLOCK_ENTRIES and
BLOCK_EVICTED, it knows nothing about the index. (of course, that is not
completely true, as it knows what it means to grab value and put it into a
meaningless Addr)

FromEntryFileAddress could work, but I don't like that it's not clear what
"file" is doing there... as in that word sticks with Entry (correct), but also
with Address (wrong)

Another option would be to not use "address" for the index, but I could not come
up with another term

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.h (right):

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:186: EntryCell FindEntryCell(uint32 hash, Addr
address);
On 2013/11/13 21:14:12, rdsmith wrote:
> I tried to grok the different usages of CreateEntryCell() and FindEntryCell()
by
> looking at the reference v3 CL, but I couldn't get it.  They look different
> enough that my initial thought of "Would a single FindOrCreate interface
work?"
> almost certainly wasn't a good one.  But I'm not certain I understand why;
could
> you give me a sense as to how these will usually be used?  Specifically, it
> seems like if you're doing a CreateEntryCell(), you've already done the hard
> work if you've figured out the address, so it sounds like CreateEntryCell
should
> only be called from code that actually does allocation within the cache (which
> is the backend??).  FindEntryCell you can do when you have the address, but
> where are you likely to have gotten the address from, if not the cache?

There should be only two callers for CreateEntryCell and that is the backend
creating a new entry (after making sure that the entry doesn't exist), and
creating an evicted entry record, when evicting a normal entry.

On the other hand, FindEntryCell is used to make sure that something that was
valid some time ago is still valid right now (this is explained with the
declaration of EntryCell).

Basically, after calling LookupEntries you end up with a bunch of records that
exist right now (hopefully just one), but the caller now has to go open that
file and read it, and that takes time. By the time that operation completes, the
caller should not use the same EntryCell right away, or at least not write to
it, because it may have moved. So it has to grab the hash and address and do a
quick look up to get the current location of the cell, and then it can write to
it.

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:205: // for the provided iterators are ignored.
On 2013/11/13 21:14:12, rdsmith wrote:
> Where is the boundary between low usage and high usage defined?

That's an implementation detail of the eviction algorithm (eviction_v3.cc)

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:211: // that this method does not return the
cells matching the provided timestamp,
On 2013/11/13 21:14:12, rdsmith wrote:
> nit(s): I'd leave out the "keep in mind" since I don't think that information
is

I meant it to be "as implied by the name" (otherwise it would be GetCurrentCells
or something like that) :) dropped.

> actually given anywhere else, change "the provided timestamp" to "the
iterator's
> timestamp", and change "(or before)" to "(or before, depending on the value of

changed the wording slightly because I think it is important to keep the
distinction between the value when it is called and on return.

> the iterator's forward member)".  With that, I think this documentation + the
> iterator documentation is good; are those changes ok with you?

Randy Smith (Not in Mondays)

Responses to comments on the interfaces; I still need to do a detailed review of ...

7 years, 1 month ago (2013-11-18 20:37:04 UTC) #29

Responses to comments on the interfaces; I still need to do a detailed review of
the implementation.

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:250: Addr EntryCell::GetAddress() const {
On 2013/11/14 02:54:19, rvargas wrote:
> We could add more methods to Address to make it aware of small tables (or add
> more arguments to the new methods), but between the two files I think it is
> better to leak from Address to Index than the other way. I could use
> Addr::FromEvictedAddress even for small tables after massaging the value, but
I
> don't think that's what you want.

It might be, actually.  I agree with you that given that we're going to leak
some information in one direction, I'd rather leak from Address to IndexTable
than the other way around (ideally I'd like to not leak at all, but I haven't
come up with any suggestions as to how to do that :-}).  But we still have two
different methods on address that look, from where I'm sitting, like they are
targeting the two different formats of address in IndexTable.  Looking just at
the addr.h header, it's not clear why we need both the Addr constructor and the
static methods Addr::From*Address().  So if you'd be comfortable either breaking
the block portion of address_value out and using the Addr() constructor in place
of From*Address (which means a bit more leakage of knowledge into IndexTable,
specifically how multi-file address values are split between file # and block #,
but we've already gestured at that in our use of the constructor) or the reverse
(which leaks the same information) I'd personally prefer that, as it'd make the
Addr interface not have any hints of index table formats in it.  

Alternatively, documentation in addr.h explaining what the From*Address() are
useful for without (explicitly or implicitly) being dependent on the IndexTable
format would be fine.  

And, finally, I'm not fanatic about this--if you really want to leave it this
way, I'm ok with that.  But it feels to me as if we're leaking information both
ways currently, and if we can I'd like to clean that up at least a little bit.

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:262: return
Addr::FromEntryAddress(address_value);
On 2013/11/14 02:54:19, rvargas wrote:
> On 2013/11/13 21:14:12, rdsmith wrote:
> > For both these calls, it feels like "From*Address" is a bit of a misnomer,
> both
> > because "address" means several different things in this code, and because
the
> > address_value means different things in the small and large table formats. 
> > Maybe From*FileIndexAddress?
> 
> The name follows what is available on addr.h :( I can change it if you feel
> strongly about it, but the only things Addr knows about is BLOCK_ENTRIES and
> BLOCK_EVICTED, it knows nothing about the index. (of course, that is not
> completely true, as it knows what it means to grab value and put it into a
> meaningless Addr)
> 
> FromEntryFileAddress could work, but I don't like that it's not clear what
> "file" is doing there... as in that word sticks with Entry (correct), but also
> with Address (wrong)
> 
> Another option would be to not use "address" for the index, but I could not
come
> up with another term

Maybe a comment inside the "if (small_table_)" conditional saying "address_value
is just a block index in the small table format, but is a full address
(including file number) in the large table format."  As I understand it, the
misnomer here is that address_value isn't a full address (including file number)
in the small table format case.

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.h (right):

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:186: EntryCell FindEntryCell(uint32 hash, Addr
address);
On 2013/11/14 02:54:19, rvargas wrote:
> On 2013/11/13 21:14:12, rdsmith wrote:
> > I tried to grok the different usages of CreateEntryCell() and
FindEntryCell()
> by
> > looking at the reference v3 CL, but I couldn't get it.  They look different
> > enough that my initial thought of "Would a single FindOrCreate interface
> work?"
> > almost certainly wasn't a good one.  But I'm not certain I understand why;
> could
> > you give me a sense as to how these will usually be used?  Specifically, it
> > seems like if you're doing a CreateEntryCell(), you've already done the hard
> > work if you've figured out the address, so it sounds like CreateEntryCell
> should
> > only be called from code that actually does allocation within the cache
(which
> > is the backend??).  FindEntryCell you can do when you have the address, but
> > where are you likely to have gotten the address from, if not the cache?
> 
> There should be only two callers for CreateEntryCell and that is the backend
> creating a new entry (after making sure that the entry doesn't exist), and
> creating an evicted entry record, when evicting a normal entry.
> 
> On the other hand, FindEntryCell is used to make sure that something that was
> valid some time ago is still valid right now (this is explained with the
> declaration of EntryCell).
> 
> Basically, after calling LookupEntries you end up with a bunch of records that
> exist right now (hopefully just one), but the caller now has to go open that
> file and read it, and that takes time. By the time that operation completes,
the
> caller should not use the same EntryCell right away, or at least not write to
> it, because it may have moved. So it has to grab the hash and address and do a
> quick look up to get the current location of the cell, and then it can write
to
> it.

So to reflect back, CreateEntryCell creates a new entry in the bucket chain for
the given hash given the address, and FindEntryCell confirms that there already
exists an entry in the bucket chain for the given hash that matches the address,
and returns it, returning a null EntryCell() if there is no such entry?  That
makes sense.  Would you be willing to add a note about the null EntryCell()
return case in the comment in the .h file, and possibly about the normal usage
for FindEntryCell() to be to confirm the existence of the given entry at that
location after an asynchronous operation?

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:205: // for the provided iterators are ignored.
On 2013/11/14 02:54:19, rvargas wrote:
> On 2013/11/13 21:14:12, rdsmith wrote:
> > Where is the boundary between low usage and high usage defined?
> 
> That's an implementation detail of the eviction algorithm (eviction_v3.cc)

So why is it exposed on a public interface?  (Not a question for this CL, more
of an overall design question.)   If it's exposed on a public interface, it
seems like it should have some sort of, maybe handwavy, definition for the
callers of that interface.

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:211: // that this method does not return the
cells matching the provided timestamp,
On 2013/11/14 02:54:19, rvargas wrote:
> On 2013/11/13 21:14:12, rdsmith wrote:
> > nit(s): I'd leave out the "keep in mind" since I don't think that
information
> is
> 
> I meant it to be "as implied by the name" (otherwise it would be
GetCurrentCells
> or something like that) :) dropped.
> 
> > actually given anywhere else, change "the provided timestamp" to "the
> iterator's
> > timestamp", and change "(or before)" to "(or before, depending on the value
of
> 
> changed the wording slightly because I think it is important to keep the
> distinction between the value when it is called and on return.
> 
> > the iterator's forward member)".  With that, I think this documentation +
the
> > iterator documentation is good; are those changes ok with you?
> 

The revised doc looks good to me.

rvargas (doing something else)

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index_table.cc File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index_table.cc#newcode250 net/disk_cache/v3/index_table.cc:250: Addr EntryCell::GetAddress() const { On 2013/11/18 20:37:04, rdsmith wrote: ...

7 years, 1 month ago (2013-11-18 23:43:33 UTC) #30

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:250: Addr EntryCell::GetAddress() const {
On 2013/11/18 20:37:04, rdsmith wrote:
> On 2013/11/14 02:54:19, rvargas wrote:
> > We could add more methods to Address to make it aware of small tables (or
add
> > more arguments to the new methods), but between the two files I think it is
> > better to leak from Address to Index than the other way. I could use
> > Addr::FromEvictedAddress even for small tables after massaging the value,
but
> I
> > don't think that's what you want.
> 
> It might be, actually.  I agree with you that given that we're going to leak
> some information in one direction, I'd rather leak from Address to IndexTable
> than the other way around (ideally I'd like to not leak at all, but I haven't
> come up with any suggestions as to how to do that :-}).  But we still have two
> different methods on address that look, from where I'm sitting, like they are
> targeting the two different formats of address in IndexTable.  Looking just at
> the addr.h header, it's not clear why we need both the Addr constructor and
the
> static methods Addr::From*Address().  So if you'd be comfortable either
breaking
> the block portion of address_value out and using the Addr() constructor in
place
> of From*Address (which means a bit more leakage of knowledge into IndexTable,
> specifically how multi-file address values are split between file # and block
#,
> but we've already gestured at that in our use of the constructor) or the
reverse
> (which leaks the same information) I'd personally prefer that, as it'd make
the
> Addr interface not have any hints of index table formats in it.  
> 
> Alternatively, documentation in addr.h explaining what the From*Address() are
> useful for without (explicitly or implicitly) being dependent on the
IndexTable
> format would be fine.  
> 
> And, finally, I'm not fanatic about this--if you really want to leave it this
> way, I'm ok with that.  But it feels to me as if we're leaking information
both
> ways currently, and if we can I'd like to clean that up at least a little bit.

Done. I'll cleanup Addr if we are happy here.

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.h (right):

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:186: EntryCell FindEntryCell(uint32 hash, Addr
address);
On 2013/11/18 20:37:04, rdsmith wrote:
> On 2013/11/14 02:54:19, rvargas wrote:
> > On 2013/11/13 21:14:12, rdsmith wrote:
> > > I tried to grok the different usages of CreateEntryCell() and
> FindEntryCell()
> > by
> > > looking at the reference v3 CL, but I couldn't get it.  They look
different
> > > enough that my initial thought of "Would a single FindOrCreate interface
> > work?"
> > > almost certainly wasn't a good one.  But I'm not certain I understand why;
> > could
> > > you give me a sense as to how these will usually be used?  Specifically,
it
> > > seems like if you're doing a CreateEntryCell(), you've already done the
hard
> > > work if you've figured out the address, so it sounds like CreateEntryCell
> > should
> > > only be called from code that actually does allocation within the cache
> (which
> > > is the backend??).  FindEntryCell you can do when you have the address,
but
> > > where are you likely to have gotten the address from, if not the cache?
> > 
> > There should be only two callers for CreateEntryCell and that is the backend
> > creating a new entry (after making sure that the entry doesn't exist), and
> > creating an evicted entry record, when evicting a normal entry.
> > 
> > On the other hand, FindEntryCell is used to make sure that something that
was
> > valid some time ago is still valid right now (this is explained with the
> > declaration of EntryCell).
> > 
> > Basically, after calling LookupEntries you end up with a bunch of records
that
> > exist right now (hopefully just one), but the caller now has to go open that
> > file and read it, and that takes time. By the time that operation completes,
> the
> > caller should not use the same EntryCell right away, or at least not write
to
> > it, because it may have moved. So it has to grab the hash and address and do
a
> > quick look up to get the current location of the cell, and then it can write
> to
> > it.
> 
> So to reflect back, CreateEntryCell creates a new entry in the bucket chain
for
> the given hash given the address, and FindEntryCell confirms that there
already
> exists an entry in the bucket chain for the given hash that matches the
address,
> and returns it, returning a null EntryCell() if there is no such entry?  That
> makes sense.  Would you be willing to add a note about the null EntryCell()
> return case in the comment in the .h file, and possibly about the normal usage
> for FindEntryCell() to be to confirm the existence of the given entry at that
> location after an asynchronous operation?
> 

I had already modified the comment to reflect use (not sure if you saw that).
Added the point about return on failure.

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:205: // for the provided iterators are ignored.
On 2013/11/18 20:37:04, rdsmith wrote:
> On 2013/11/14 02:54:19, rvargas wrote:
> > On 2013/11/13 21:14:12, rdsmith wrote:
> > > Where is the boundary between low usage and high usage defined?
> > 
> > That's an implementation detail of the eviction algorithm (eviction_v3.cc)
> 
> So why is it exposed on a public interface?  (Not a question for this CL, more
> of an overall design question.)   If it's exposed on a public interface, it
> seems like it should have some sort of, maybe handwavy, definition for the
> callers of that interface.

I'm not sure I follow. It is not defined in an interface, but clearly the caller
should be able to get the data out of the table. What is defined is that each
entry has a use counter, a fetch counter and a group. (cells don't have fetch
counter, but have a group). The caller is able to set the group and counter at
will, retrieve it from a particular cell, and get all cells from all groups
given a time constrain (this method). The mapping of counters to groups is the
detail that is private to the eviction code.

The only detail about the arguments is that the method doesn't return evicted
entries, in part because I have not written the code to delete evicted entries
yet. In that sense, this method provides strictly what the caller needs so it's
not really general purpose, but I think that's fine.

Randy Smith (Not in Mondays)

Part-way through reviewing Init(), and getting somewhat bogged down in understanding the use of hash ...

7 years ago (2013-11-25 19:48:07 UTC) #31

Part-way through reviewing Init(), and getting somewhat bogged down in
understanding the use of hash values, so pausing to provide these comments and
ask for clarifications on hash.  I believe that:
* The hash for a particular key is 32 bits, generated by base::Hash(key)
* The bottom bits (a variable number of bits, depending on table_len/mask_) are
used as the address of the relevant bucket for that entry.
* The top bits (32 - kHash{SmallTable,}Shift) are stored in the IndexCell and
compared on cell lookup.  That shift means that we're comparing just the number
of bits stored in the table for each size; i.e. kHash{SmallTable,}Shift is
dependent on the number of bits stored for the hash in the different types of
table/cells.
* The top and bottom bits in this scheme may overlap.
* When we increase the table size via doubling (i.e. increase the number of
buckets), that may mean that the extra bit (or bits, if we haven't touched a
cell from previous doublings) used in indexing into the table may be wrong, and
the cell may need to be moved.  This is what the MisplacedCell() logic is about.

That all correct?  

I feel as if both address and hash could benefit from a comment somewhere
indicating that they mean different things in different parts of the code, with
an explanation as to what those differences are (if this is in the design doc,
that's fine, and my apologies for forgetting it).

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:250: Addr EntryCell::GetAddress() const {
On 2013/11/18 23:43:34, rvargas wrote:
> On 2013/11/18 20:37:04, rdsmith wrote:
> > On 2013/11/14 02:54:19, rvargas wrote:
> > > We could add more methods to Address to make it aware of small tables (or
> add
> > > more arguments to the new methods), but between the two files I think it
is
> > > better to leak from Address to Index than the other way. I could use
> > > Addr::FromEvictedAddress even for small tables after massaging the value,
> but
> > I
> > > don't think that's what you want.
> > 
> > It might be, actually.  I agree with you that given that we're going to leak
> > some information in one direction, I'd rather leak from Address to
IndexTable
> > than the other way around (ideally I'd like to not leak at all, but I
haven't
> > come up with any suggestions as to how to do that :-}).  But we still have
two
> > different methods on address that look, from where I'm sitting, like they
are
> > targeting the two different formats of address in IndexTable.  Looking just
at
> > the addr.h header, it's not clear why we need both the Addr constructor and
> the
> > static methods Addr::From*Address().  So if you'd be comfortable either
> breaking
> > the block portion of address_value out and using the Addr() constructor in
> place
> > of From*Address (which means a bit more leakage of knowledge into
IndexTable,
> > specifically how multi-file address values are split between file # and
block
> #,
> > but we've already gestured at that in our use of the constructor) or the
> reverse
> > (which leaks the same information) I'd personally prefer that, as it'd make
> the
> > Addr interface not have any hints of index table formats in it.  
> > 
> > Alternatively, documentation in addr.h explaining what the From*Address()
are
> > useful for without (explicitly or implicitly) being dependent on the
> IndexTable
> > format would be fine.  
> > 
> > And, finally, I'm not fanatic about this--if you really want to leave it
this
> > way, I'm ok with that.  But it feels to me as if we're leaking information
> both
> > ways currently, and if we can I'd like to clean that up at least a little
bit.
> 
> Done. I'll cleanup Addr if we are happy here.

Thanks!  Yes, this looks much better.  Go ahead and cleanup Addr.

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.h (right):

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:186: EntryCell FindEntryCell(uint32 hash, Addr
address);
On 2013/11/18 23:43:34, rvargas wrote:
> On 2013/11/18 20:37:04, rdsmith wrote:
> > On 2013/11/14 02:54:19, rvargas wrote:
> > > On 2013/11/13 21:14:12, rdsmith wrote:
> > > > I tried to grok the different usages of CreateEntryCell() and
> > FindEntryCell()
> > > by
> > > > looking at the reference v3 CL, but I couldn't get it.  They look
> different
> > > > enough that my initial thought of "Would a single FindOrCreate interface
> > > work?"
> > > > almost certainly wasn't a good one.  But I'm not certain I understand
why;
> > > could
> > > > you give me a sense as to how these will usually be used?  Specifically,
> it
> > > > seems like if you're doing a CreateEntryCell(), you've already done the
> hard
> > > > work if you've figured out the address, so it sounds like
CreateEntryCell
> > > should
> > > > only be called from code that actually does allocation within the cache
> > (which
> > > > is the backend??).  FindEntryCell you can do when you have the address,
> but
> > > > where are you likely to have gotten the address from, if not the cache?
> > > 
> > > There should be only two callers for CreateEntryCell and that is the
backend
> > > creating a new entry (after making sure that the entry doesn't exist), and
> > > creating an evicted entry record, when evicting a normal entry.
> > > 
> > > On the other hand, FindEntryCell is used to make sure that something that
> was
> > > valid some time ago is still valid right now (this is explained with the
> > > declaration of EntryCell).
> > > 
> > > Basically, after calling LookupEntries you end up with a bunch of records
> that
> > > exist right now (hopefully just one), but the caller now has to go open
that
> > > file and read it, and that takes time. By the time that operation
completes,
> > the
> > > caller should not use the same EntryCell right away, or at least not write
> to
> > > it, because it may have moved. So it has to grab the hash and address and
do
> a
> > > quick look up to get the current location of the cell, and then it can
write
> > to
> > > it.
> > 
> > So to reflect back, CreateEntryCell creates a new entry in the bucket chain
> for
> > the given hash given the address, and FindEntryCell confirms that there
> already
> > exists an entry in the bucket chain for the given hash that matches the
> address,
> > and returns it, returning a null EntryCell() if there is no such entry? 
That
> > makes sense.  Would you be willing to add a note about the null EntryCell()
> > return case in the comment in the .h file, and possibly about the normal
usage
> > for FindEntryCell() to be to confirm the existence of the given entry at
that
> > location after an asynchronous operation?
> > 
> 
> I had already modified the comment to reflect use (not sure if you saw that).
> Added the point about return on failure.

Yeah, sorry, missed the mod.  This looks good.

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:205: // for the provided iterators are ignored.
On 2013/11/18 23:43:34, rvargas wrote:
> On 2013/11/18 20:37:04, rdsmith wrote:
> > On 2013/11/14 02:54:19, rvargas wrote:
> > > On 2013/11/13 21:14:12, rdsmith wrote:
> > > > Where is the boundary between low usage and high usage defined?
> > > 
> > > That's an implementation detail of the eviction algorithm (eviction_v3.cc)
> > 
> > So why is it exposed on a public interface?  (Not a question for this CL,
more
> > of an overall design question.)   If it's exposed on a public interface, it
> > seems like it should have some sort of, maybe handwavy, definition for the
> > callers of that interface.
> 
> I'm not sure I follow. It is not defined in an interface, but clearly the
caller
> should be able to get the data out of the table. What is defined is that each
> entry has a use counter, a fetch counter and a group. (cells don't have fetch
> counter, but have a group). The caller is able to set the group and counter at
> will, retrieve it from a particular cell, and get all cells from all groups
> given a time constrain (this method). The mapping of counters to groups is the
> detail that is private to the eviction code.

Ah, ok, I think I follow.  The category that an entry is in can be set
externally to that entry, but is limited to values in the EntryGroup enum.  The
eviction code is the only code that sets those values, and is the only consumer
of this interface.  So there's a way in which this interface is hiding consumer
information from the producer, not the other way around.

Just looking at this interface in isolation, I'd value having a comment
reference indicating that the iterator into which individual entries are placed
depends in the EntryGroup value associated with them.  But I could easily
imagine that anyone who was doing any extended work in the cache would recognize
that *_use referred to EntryGroup, and hence wouldn't need the reference.  So
I'll raise that as an idea and leave it up to you whether to take it or not.

> The only detail about the arguments is that the method doesn't return evicted
> entries, in part because I have not written the code to delete evicted entries
> yet. In that sense, this method provides strictly what the caller needs so
it's
> not really general purpose, but I think that's fine.

Yeah, I agree.  It's also very much targeted at the eviction code--no other
caller will really have any use for it.  In an ideal world, there'd be a private
pipe for that information from IndexTable->eviction code so that other consumers
wouldn't be distracted by/have access to the interface, but I think that's way
more trouble than it's worth.

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:454: DCHECK_GE(extra_size, 0);
Does this mean we always have to incrementally grow the extra table in between
doublings of the main table?  Why put that restriction on it?

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:463: memset(params->extra_table, 0, extra_size
* sizeof(IndexBucket));
It feels to me like there are assumptions in this code about aliasing, but I'm
not certain what they are.   Let me go over what I believe the context for this
code is and make comments conditional on that.  

I believe that this function will be called in only three situations:
* First initialization.  main_table_ will be null, params->main_table will point
to memory mapped from a file, and params->extra_table will be null.
* Growth of extra table.  params->main_table will be NULL.   params->extra_table
!= extra_table_, but will point to a memory mapped region of the same file, only
larger.  Thus any changes made to params->extra_table will be reflected in
extra_table_ and vice versa (for the length that extra_table_ is mapped for). 
extra_table_ will be unmapped after this function returns (does the calling code
have it's own pointer to extra_table_?)
* Doubling of main table.  params->main_table != main_table_, but will point to
a memory mapped region of the same file, only doubled in size. 
params->extra_table will point to a section of memory mapped to the extra table
file, that hasn't changed in size any.  params->extra_table != extra_table_ and
the location pointed to by extra_table_ will be unmapped after this routine
finishes (??), but the two memory regions will be direct aliases to one another.

(I'll probably want a comment clarifying these assumptions somewhere, but I'll
wait to suggest what until I'm certain I understand it all. :-})

Assuming that that's correct, why refer to the same data through two different
points (extra_table_ and params->extra_table)?  It seems like it's just going to
generate confusion.

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:473: DCHECK_LE(extra_bits_, 11);
What results in this 11?  This seems like you're simply saying that you can't
have a table_len greater than 32 bits, in which case don't you want to just say
that?  (I.e. IIRTCC the 11 is dependent on a specific value of kBaseTableLen,
and that dependency should be explicit in the DCHECK).

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:475: small_table_ = extra_bits_ < kHashShift -
kHashSmallTableShift;
In other words, while the number of bits that are thrown away before making the
comparison between the hash and the IndexCell partial hash is greater than the
number of bits used to index into the table, leave the table small, but when we
start using all those thrown-away bits and crowding into the comparison section,
switch to the large table?  

(I'll note that this seems like a different metric for the small/large table
distinction that I'd previously understood--I thought it was about whether or
not the entries will always be stored in one file.  But looking at
disk_format_v3.h, it seems like the distinction is specified purely by the
number of entries.  If there's a subtle background connection between these
three things, it should probably be made explicit, or if one of them isn't
relevant, that should be made clear too.  

Based on how much pain it took me to figure out, I'll probably also want an
explanatory comment about what hashes mean (in the different contexts)
somewhere, but I'll ask for that once I'm confident I actually understand how
they're used.)

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:485: // creating the bitmaps, clear the part of
the extra table.
nit, suggestion: "clear the part of the bitmap referring to the extra table"?

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:502: DCHECK_GE(num_words, old_num_words);
Under what circumstances will they be equal?  Does that imply that we could be
doubling table size in a situation where the extra_table is the same size as the
added main table size?

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.h (right):

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:264: int extra_bits_;  // How many bits are in
mask_ above the default value.
I could certainly be reading the code incorrectly, but this comment seems
incorrect to me.  From IndexTable::Init:

  extra_bits_ = base::bits::Log2Floor(header_->table_len) -
                base::bits::Log2Floor(kBaseTableLen);
  DCHECK_GE(extra_bits_, 0);
  DCHECK_LE(extra_bits_, 11);
  mask_ = ((kBaseTableLen / kCellsPerBucket) << extra_bits_) - 1;   // <<--
  small_table_ = extra_bits_ < kHashShift - kHashSmallTableShift;
  if (!small_table_)
    extra_bits_ -= kHashShift - kHashSmallTableShift;

At the marked line, the relationship in the comment seems accurate to me.  But
if we don't have a small table, we reduce extra_bits_ by the different in hash
shift between small and large tables, which would seem to make the comment
invalid.  What am I missing?

rvargas (doing something else)

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index_table.cc File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index_table.cc#newcode454 net/disk_cache/v3/index_table.cc:454: DCHECK_GE(extra_size, 0); On 2013/11/25 19:48:07, rdsmith wrote: > Does ...

7 years ago (2013-11-26 00:32:41 UTC) #32

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:454: DCHECK_GE(extra_size, 0);
On 2013/11/25 19:48:07, rdsmith wrote:
> Does this mean we always have to incrementally grow the extra table in between
> doublings of the main table?  Why put that restriction on it?

Nope. It can be zero.

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:463: memset(params->extra_table, 0, extra_size
* sizeof(IndexBucket));
On 2013/11/25 19:48:07, rdsmith wrote:
> It feels to me like there are assumptions in this code about aliasing, but I'm
> not certain what they are.   Let me go over what I believe the context for
this
> code is and make comments conditional on that.  
> 
> I believe that this function will be called in only three situations:
> * First initialization.  main_table_ will be null, params->main_table will
point
> to memory mapped from a file, and params->extra_table will be null.

params->extra_table should not be null.

> * Growth of extra table.  params->main_table will be NULL.  
params->extra_table
> != extra_table_, but will point to a memory mapped region of the same file,
only
> larger.  Thus any changes made to params->extra_table will be reflected in
> extra_table_ and vice versa (for the length that extra_table_ is mapped for). 
> extra_table_ will be unmapped after this function returns (does the calling
code
> have it's own pointer to extra_table_?)

Yes, someone will know how to unmap the old table after this method returns.

> * Doubling of main table.  params->main_table != main_table_, but will point
to
> a memory mapped region of the same file, only doubled in size. 
> params->extra_table will point to a section of memory mapped to the extra
table
> file, that hasn't changed in size any.  params->extra_table != extra_table_
and
> the location pointed to by extra_table_ will be unmapped after this routine
> finishes (??), but the two memory regions will be direct aliases to one
another.

sounds correct.

The underlying assumption is that the index is memory mapped and follows the
format (and files) described in disk_format_v3.h, so the act of re-initializing
the IndexTable is a consequence of growing some file and creating another map of
the new extension. As such, we should never receive the same pointer that we are
currently using (but there's no check for that, because it shouldn't really
matter), and we deal with pointers not mapped sections (in fact all the unit
tests just allocate memory, never map anything).

Aliasing is important in that we don't memcpy data from the old mapping to the
new mapping when the file just grows, as that would be a waste of time. So yes,
this has a memory interface, but expects the caller to deal with the files as
defined by the file format, not just random pointers (hence
TestCacheTables::CopyFrom simulating that behavior)

But that should be clear from the comment at the top of index_table.h, right?

> (I'll probably want a comment clarifying these assumptions somewhere, but I'll
> wait to suggest what until I'm certain I understand it all. :-})
> 
> Assuming that that's correct, why refer to the same data through two different
> points (extra_table_ and params->extra_table)?  It seems like it's just going
to
> generate confusion.  

Do I? I try to refer to the data that logically is being accessed... so for
instance line 461 copies the old data somewhere safe and line 463 clears the
storage for the new data. Production code could copy the new data and clear the
old data, but that would confuse the reader and prevent the unit tests from
working :)

I mean, the whole thing requires mapping (or a judicious emulation from the
caller), but if this code is misusing source or destination somewhere assuming
that they are the same, let me know so that it's fixed.

So, conceptually, the caller could suspend all requests to this class, copy the
file to another file and map the new file (so that there is no aliasing) before
calling into this class again asking for re-init.

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:473: DCHECK_LE(extra_bits_, 11);
On 2013/11/25 19:48:07, rdsmith wrote:
> What results in this 11?  This seems like you're simply saying that you can't
> have a table_len greater than 32 bits, in which case don't you want to just
say
> that?  (I.e. IIRTCC the 11 is dependent on a specific value of kBaseTableLen,
> and that dependency should be explicit in the DCHECK).

This is just saying that the table can be doubled 11 times from the smaller
table  to the biggest table. Yes, the value (11) is derived from kBaseTableLen
and maybe kMaxAddress or kCellAddressMask, but I thought it was clearer to have
the number of bits here (right when we are saying "how many extra bits can we
have") rather to come out with a probably cryptic formula that derives the
proper value from other numbers. The cost is of course less flexibility if the
format change, but at least it will fail right when it is needed. When I was
writing the code this value changed right here multiple times depending on what
I had implemented (and was ready to test) so it seemed better to keep it that
way instead of tying it to other constants.

I can define a local constant here if you don't like it.

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:475: small_table_ = extra_bits_ < kHashShift -
kHashSmallTableShift;
On 2013/11/25 19:48:07, rdsmith wrote:
> In other words, while the number of bits that are thrown away before making
the
> comparison between the hash and the IndexCell partial hash is greater than the
> number of bits used to index into the table, leave the table small, but when
we
> start using all those thrown-away bits and crowding into the comparison
section,
> switch to the large table?

I didn't intend to make a strong statement about what this means. It really just
means extra_bits_ < 6, as in the table can be doubled 6 times before reaching
the 64k border between the two formats.

It's just that those two values are tightly related to each other, and the only
difference between them is the table type (small or large), and subtracting them
gives me the 6 that I need. Somewhat similar to the previous '11' but this time
I'm tying it to other constants that change accordingly (and as far as I
remember I use this same formula in other places).

> 
> (I'll note that this seems like a different metric for the small/large table
> distinction that I'd previously understood--I thought it was about whether or
> not the entries will always be stored in one file.  But looking at
> disk_format_v3.h, it seems like the distinction is specified purely by the
> number of entries.  If there's a subtle background connection between these
> three things, it should probably be made explicit, or if one of them isn't
> relevant, that should be made clear too.

The distinction between small and large is basically arbitrary. 64K is an
interesting number, and it is the base table size of V2, so it was just natural
to keep using it. A small table is a new feature of V3, so it covers tables
smaller than 64K.

The maximum number of blocks on a block file also happens to be 64k mostly
because 64k is an interesting number (and it doesn't yield files that are too
big with the largest block file), and it requires two standard pages (4k each)
to hold the bitmap. 

That results in being able to store up to 64k entries on a single file (in fact,
slightly less than that). And that just fits in that it allows the implied file
feature of small tables, but it is not the cause of the border (although it
would be hard to move that border somewhere else).

So things fit, but the "real" distinction between small and large tables is
arbitrarily a 64k table. That's why some numbers are really "count how many
times we can do foo", given the arbitrary limits of the format (start at 1k,
switch at 64k, up to 22 bits)

> 
> Based on how much pain it took me to figure out, I'll probably also want an
> explanatory comment about what hashes mean (in the different contexts)
> somewhere, but I'll ask for that once I'm confident I actually understand how
> they're used.)

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:502: DCHECK_GE(num_words, old_num_words);
On 2013/11/25 19:48:07, rdsmith wrote:
> Under what circumstances will they be equal?  Does that imply that we could be
> doubling table size in a situation where the extra_table is the same size as
the
> added main table size?

Mostly a check for not getting a negative index below.

We could be doing that, or maybe growing the extra table by 16 entries (so that
the number of words doesn't change). Backend tests grow the table really slowly
:)

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.h (right):

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:264: int extra_bits_;  // How many bits are in
mask_ above the default value.
On 2013/11/25 19:48:07, rdsmith wrote:
> I could certainly be reading the code incorrectly, but this comment seems
> incorrect to me.  From IndexTable::Init:
> 
>   extra_bits_ = base::bits::Log2Floor(header_->table_len) -
>                 base::bits::Log2Floor(kBaseTableLen);
>   DCHECK_GE(extra_bits_, 0);
>   DCHECK_LE(extra_bits_, 11);
>   mask_ = ((kBaseTableLen / kCellsPerBucket) << extra_bits_) - 1;   // <<--
>   small_table_ = extra_bits_ < kHashShift - kHashSmallTableShift;
>   if (!small_table_)
>     extra_bits_ -= kHashShift - kHashSmallTableShift;
> 
> At the marked line, the relationship in the comment seems accurate to me.  But
> if we don't have a small table, we reduce extra_bits_ by the different in hash
> shift between small and large tables, which would seem to make the comment
> invalid.  What am I missing?

nothing :). I could say that "default value" was not defined, and there is a
default value for small table and another one for normal tables (which is what
happens given that extra_bits is effectively applied against small or normal
shifts depending on the type of table)

rvargas (doing something else)

On 2013/11/25 19:48:07, rdsmith wrote: > Part-way through reviewing Init(), and getting somewhat bogged down ...

7 years ago (2013-11-26 00:48:56 UTC) #33

On 2013/11/25 19:48:07, rdsmith wrote:
> Part-way through reviewing Init(), and getting somewhat bogged down in
> understanding the use of hash values, so pausing to provide these comments and
> ask for clarifications on hash.  I believe that:
> * The hash for a particular key is 32 bits, generated by base::Hash(key)
> * The bottom bits (a variable number of bits, depending on table_len/mask_)
are
> used as the address of the relevant bucket for that entry.
> * The top bits (32 - kHash{SmallTable,}Shift) are stored in the IndexCell and
> compared on cell lookup.  That shift means that we're comparing just the
number
> of bits stored in the table for each size; i.e. kHash{SmallTable,}Shift is
> dependent on the number of bits stored for the hash in the different types of
> table/cells.
> * The top and bottom bits in this scheme may overlap.
> * When we increase the table size via doubling (i.e. increase the number of
> buckets), that may mean that the extra bit (or bits, if we haven't touched a
> cell from previous doublings) used in indexing into the table may be wrong,
and
> the cell may need to be moved.  This is what the MisplacedCell() logic is
about.

Roughly yes. As the table grows, bits move from the "we need to store this part"
to the "this part is implied given the position in the table". There is no need
to store store less and less bits each time (and complicate all the logic) so
that means that each time we double the size, there's another bit that becomes
redundant.

When we double the table, we move all cells that have to moved to the upper half
of the table (about half the cells should move). If that goes well, there will
never be a misplaced cell. However, if there is a crash while we are moving
cells (even though this is a memory-only operation), we could end up with some
cells on the right place and some cells in the wrong place... those misplaced
cells are detected by a mismatch on the redundant bits, and can be moved as we
see them.

>  
> 
> That all correct?  
> 
> I feel as if both address and hash could benefit from a comment somewhere
> indicating that they mean different things in different parts of the code,
with
> an explanation as to what those differences are (if this is in the design doc,
> that's fine, and my apologies for forgetting it).

How the hash is used to locate entries on the table (and doubling of the table)
is covered on the design doc. But I don't fully follow what you're thinking.

thanks.

rvargas (doing something else)

Updated version https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index_table.cc File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index_table.cc#newcode250 net/disk_cache/v3/index_table.cc:250: Addr EntryCell::GetAddress() const { On 2013/11/25 19:48:07, ...

7 years ago (2013-11-26 19:54:40 UTC) #34

Updated version

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:250: Addr EntryCell::GetAddress() const {
On 2013/11/25 19:48:07, rdsmith wrote:
> On 2013/11/18 23:43:34, rvargas wrote:
> > On 2013/11/18 20:37:04, rdsmith wrote:
> > > On 2013/11/14 02:54:19, rvargas wrote:
> > > > We could add more methods to Address to make it aware of small tables
(or
> > add
> > > > more arguments to the new methods), but between the two files I think it
> is
> > > > better to leak from Address to Index than the other way. I could use
> > > > Addr::FromEvictedAddress even for small tables after massaging the
value,
> > but
> > > I
> > > > don't think that's what you want.
> > > 
> > > It might be, actually.  I agree with you that given that we're going to
leak
> > > some information in one direction, I'd rather leak from Address to
> IndexTable
> > > than the other way around (ideally I'd like to not leak at all, but I
> haven't
> > > come up with any suggestions as to how to do that :-}).  But we still have
> two
> > > different methods on address that look, from where I'm sitting, like they
> are
> > > targeting the two different formats of address in IndexTable.  Looking
just
> at
> > > the addr.h header, it's not clear why we need both the Addr constructor
and
> > the
> > > static methods Addr::From*Address().  So if you'd be comfortable either
> > breaking
> > > the block portion of address_value out and using the Addr() constructor in
> > place
> > > of From*Address (which means a bit more leakage of knowledge into
> IndexTable,
> > > specifically how multi-file address values are split between file # and
> block
> > #,
> > > but we've already gestured at that in our use of the constructor) or the
> > reverse
> > > (which leaks the same information) I'd personally prefer that, as it'd
make
> > the
> > > Addr interface not have any hints of index table formats in it.  
> > > 
> > > Alternatively, documentation in addr.h explaining what the From*Address()
> are
> > > useful for without (explicitly or implicitly) being dependent on the
> > IndexTable
> > > format would be fine.  
> > > 
> > > And, finally, I'm not fanatic about this--if you really want to leave it
> this
> > > way, I'm ok with that.  But it feels to me as if we're leaking information
> > both
> > > ways currently, and if we can I'd like to clean that up at least a little
> bit.
> > 
> > Done. I'll cleanup Addr if we are happy here.
> 
> Thanks!  Yes, this looks much better.  Go ahead and cleanup Addr.

Done.

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.h (right):

https://codereview.chromium.org/53313004/diff/1530001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:205: // for the provided iterators are ignored.
On 2013/11/25 19:48:07, rdsmith wrote:
> On 2013/11/18 23:43:34, rvargas wrote:
> > On 2013/11/18 20:37:04, rdsmith wrote:
> > > On 2013/11/14 02:54:19, rvargas wrote:
> > > > On 2013/11/13 21:14:12, rdsmith wrote:
> > > > > Where is the boundary between low usage and high usage defined?
> > > > 
> > > > That's an implementation detail of the eviction algorithm
(eviction_v3.cc)
> > > 
> > > So why is it exposed on a public interface?  (Not a question for this CL,
> more
> > > of an overall design question.)   If it's exposed on a public interface,
it
> > > seems like it should have some sort of, maybe handwavy, definition for the
> > > callers of that interface.
> > 
> > I'm not sure I follow. It is not defined in an interface, but clearly the
> caller
> > should be able to get the data out of the table. What is defined is that
each
> > entry has a use counter, a fetch counter and a group. (cells don't have
fetch
> > counter, but have a group). The caller is able to set the group and counter
at
> > will, retrieve it from a particular cell, and get all cells from all groups
> > given a time constrain (this method). The mapping of counters to groups is
the
> > detail that is private to the eviction code.
> 
> Ah, ok, I think I follow.  The category that an entry is in can be set
> externally to that entry, but is limited to values in the EntryGroup enum. 
The
> eviction code is the only code that sets those values, and is the only
consumer
> of this interface.  So there's a way in which this interface is hiding
consumer
> information from the producer, not the other way around.
> 
> Just looking at this interface in isolation, I'd value having a comment
> reference indicating that the iterator into which individual entries are
placed
> depends in the EntryGroup value associated with them.  But I could easily
> imagine that anyone who was doing any extended work in the cache would
recognize
> that *_use referred to EntryGroup, and hence wouldn't need the reference.  So
> I'll raise that as an idea and leave it up to you whether to take it or not.

done
> 
> > The only detail about the arguments is that the method doesn't return
evicted
> > entries, in part because I have not written the code to delete evicted
entries
> > yet. In that sense, this method provides strictly what the caller needs so
> it's
> > not really general purpose, but I think that's fine.
> 
> Yeah, I agree.  It's also very much targeted at the eviction code--no other
> caller will really have any use for it.  In an ideal world, there'd be a
private
> pipe for that information from IndexTable->eviction code so that other
consumers
> wouldn't be distracted by/have access to the interface, but I think that's way
> more trouble than it's worth.

Randy Smith (Not in Mondays)

On 2013/11/26 00:48:56, rvargas wrote: > On 2013/11/25 19:48:07, rdsmith wrote: > > Part-way through ...

7 years ago (2013-12-02 18:53:42 UTC) #35

On 2013/11/26 00:48:56, rvargas wrote:
> On 2013/11/25 19:48:07, rdsmith wrote:
> > Part-way through reviewing Init(), and getting somewhat bogged down in
> > understanding the use of hash values, so pausing to provide these comments
and
> > ask for clarifications on hash.  I believe that:
> > * The hash for a particular key is 32 bits, generated by base::Hash(key)
> > * The bottom bits (a variable number of bits, depending on table_len/mask_)
> are
> > used as the address of the relevant bucket for that entry.
> > * The top bits (32 - kHash{SmallTable,}Shift) are stored in the IndexCell
and
> > compared on cell lookup.  That shift means that we're comparing just the
> number
> > of bits stored in the table for each size; i.e. kHash{SmallTable,}Shift is
> > dependent on the number of bits stored for the hash in the different types
of
> > table/cells.
> > * The top and bottom bits in this scheme may overlap.
> > * When we increase the table size via doubling (i.e. increase the number of
> > buckets), that may mean that the extra bit (or bits, if we haven't touched a
> > cell from previous doublings) used in indexing into the table may be wrong,
> and
> > the cell may need to be moved.  This is what the MisplacedCell() logic is
> about.
> 
> Roughly yes. As the table grows, bits move from the "we need to store this
part"
> to the "this part is implied given the position in the table". There is no
need
> to store store less and less bits each time (and complicate all the logic) so
> that means that each time we double the size, there's another bit that becomes
> redundant.
> 
> When we double the table, we move all cells that have to moved to the upper
half
> of the table (about half the cells should move). If that goes well, there will
> never be a misplaced cell. However, if there is a crash while we are moving
> cells (even though this is a memory-only operation), we could end up with some
> cells on the right place and some cells in the wrong place... those misplaced
> cells are detected by a mismatch on the redundant bits, and can be moved as we
> see them.

That makes sense.  Where is the logic that moves the cells on table doubling? 
Scanning Init(), I only see a call to MoveCells for the old_extra_table cells.  

Two flags get raised in my brain around crash handling (== unlikely but not
impossible corner case) code.  The first is that it needs to be tested
thoroughly, because our chances of noticing problems in it from more general
testing and the field is lower.  Quickly skimming the unit test, I'm not sure I
see anything that will result in exercising the
MisplacedHash/HandleMisplacedHash logic; am I missing it?   

The second is that code to handle rare corner cases can sometimes result in
masking other bugs.  It'd be nice if there was some way to disable the misplaced
hash logic for the tests that aren't trying to emulate the crashing code.   

WDYT?

(Still working on review; responding to this now just because it's a top-level
comment and thus easiest to respond to directly.)

> 
> >  
> > 
> > That all correct?  
> > 
> > I feel as if both address and hash could benefit from a comment somewhere
> > indicating that they mean different things in different parts of the code,
> with
> > an explanation as to what those differences are (if this is in the design
doc,
> > that's fine, and my apologies for forgetting it).
> 
> How the hash is used to locate entries on the table (and doubling of the
table)
> is covered on the design doc. But I don't fully follow what you're thinking.
> 
> thanks.

Randy Smith (Not in Mondays)

On 2013/11/26 00:48:56, rvargas wrote: > On 2013/11/25 19:48:07, rdsmith wrote: > > I feel ...

7 years ago (2013-12-02 19:28:01 UTC) #36

On 2013/11/26 00:48:56, rvargas wrote:
> On 2013/11/25 19:48:07, rdsmith wrote:
> > I feel as if both address and hash could benefit from a comment somewhere
> > indicating that they mean different things in different parts of the code,
> with
> > an explanation as to what those differences are (if this is in the design
doc,
> > that's fine, and my apologies for forgetting it).
> 
> How the hash is used to locate entries on the table (and doubling of the
table)
> is covered on the design doc. But I don't fully follow what you're thinking.
> 
> thanks.

So the design document does cover what the hash is and how it's used to lookup
entries (and thank you--it was useful for me to go back to the design doc.  I
think I get more from it each time I go over it after reading the code, not
surprisingly).  But my concern is more that the concepts "hash" and "address"
mean different things at different places in the code, and I think it's easy for
the reader to assume that they mean the same thing, and be confused that way. 
If they were consistently referred to by different names in the code, that would
address my concern.  I'm not sure if this should be a high priority--if the
concepts are clear, it may be that a code reader will be able to track by
context which meaning of hash/address is currently relevant.  But that's the
concern I was raising. 

With regard to addresses, I think my confusion is between what I'll call "full
addresses" (i.e. what Addr() passes around), and "address values" (what you get
from IndexCells and EntryCell::GetAddressValue()).  I think it would be clearer
if those things used different names consistently.  Does this make sense to you?
 I think that would mean:
  * Changing the names of {Get,Set}Cell{SmallTable,}Address &
    {FileNumber,StartBlock}FromAddress, along with the variables used
    inside those functions?  
  * I think based on its usage in IsValidAddress, kMaxAddress would also
    become kMaxAddressValue.

With regard to hash, similarly there's the portion of the hash that's stored in
IndexCell (call it "high hash"), the portion that's implied by the position of
the IndexCell within the IndexTable ("low hash"), and the whole thing (gotten
from base::Hash() and split up as needed, "full hash").  It looks to me as if
what I'm calling the low hash is generally referred to as a bucket hash, so it
may be the only real confusion is between high hash and full hash (for example,
the EntryCell constructor takes a full hash and passes a high hash to
SetCellHash, storing the full hash in hash_).  I'm not sure what would be
involved in clearing this up or if it's worth it, but I suspect that just using
different words for the high hash and the full hash (whatever they are) would
reduce confusion in code readers.

rvargas (doing something else)

On 2013/12/02 18:53:42, rdsmith wrote: > On 2013/11/26 00:48:56, rvargas wrote: > > On 2013/11/25 ...

7 years ago (2013-12-02 21:01:49 UTC) #37

On 2013/12/02 18:53:42, rdsmith wrote:
> On 2013/11/26 00:48:56, rvargas wrote:
> > On 2013/11/25 19:48:07, rdsmith wrote:
> > > Part-way through reviewing Init(), and getting somewhat bogged down in
> > > understanding the use of hash values, so pausing to provide these comments
> and
> > > ask for clarifications on hash.  I believe that:
> > > * The hash for a particular key is 32 bits, generated by base::Hash(key)
> > > * The bottom bits (a variable number of bits, depending on
table_len/mask_)
> > are
> > > used as the address of the relevant bucket for that entry.
> > > * The top bits (32 - kHash{SmallTable,}Shift) are stored in the IndexCell
> and
> > > compared on cell lookup.  That shift means that we're comparing just the
> > number
> > > of bits stored in the table for each size; i.e. kHash{SmallTable,}Shift is
> > > dependent on the number of bits stored for the hash in the different types
> of
> > > table/cells.
> > > * The top and bottom bits in this scheme may overlap.
> > > * When we increase the table size via doubling (i.e. increase the number
of
> > > buckets), that may mean that the extra bit (or bits, if we haven't touched
a
> > > cell from previous doublings) used in indexing into the table may be
wrong,
> > and
> > > the cell may need to be moved.  This is what the MisplacedCell() logic is
> > about.
> > 
> > Roughly yes. As the table grows, bits move from the "we need to store this
> part"
> > to the "this part is implied given the position in the table". There is no
> need
> > to store store less and less bits each time (and complicate all the logic)
so
> > that means that each time we double the size, there's another bit that
becomes
> > redundant.
> > 
> > When we double the table, we move all cells that have to moved to the upper
> half
> > of the table (about half the cells should move). If that goes well, there
will
> > never be a misplaced cell. However, if there is a crash while we are moving
> > cells (even though this is a memory-only operation), we could end up with
some
> > cells on the right place and some cells in the wrong place... those
misplaced
> > cells are detected by a mismatch on the redundant bits, and can be moved as
we
> > see them.
> 
> That makes sense.  Where is the logic that moves the cells on table doubling? 
> Scanning Init(), I only see a call to MoveCells for the old_extra_table cells.
 

At the end of Init(), there is a call to MoveCells(). MoveSingleCell() goes
through a loop and calls MoveSingleCell() when needed... note that the function
goes through main_table_, even though it needs the caller to provide a pointer
to the old_extra_table.


> 
> Two flags get raised in my brain around crash handling (== unlikely but not
> impossible corner case) code.  The first is that it needs to be tested
> thoroughly, because our chances of noticing problems in it from more general
> testing and the field is lower.  Quickly skimming the unit test, I'm not sure
I
> see anything that will result in exercising the
> MisplacedHash/HandleMisplacedHash logic; am I missing it?   
> 
> The second is that code to handle rare corner cases can sometimes result in
> masking other bugs.  It'd be nice if there was some way to disable the
misplaced
> hash logic for the tests that aren't trying to emulate the crashing code.   
> 

The tests are not going through inconsistent states or checks at this point...
some of the checks performed will crash the test if they fail (because most
tests run without a backend)... and the tests themselves are not generating bad
data (but I must admit that was a manual process... as in they did for a while,
and then some step of the test will fail, as not being able to locate
something).

I'm not too worried at this point with error recovery paths not covered by
tests; I have not even written all those paths to start with, and it is more
important to bring the whole system up and start running unit tests of the whole
thing.


> WDYT?
> 
> (Still working on review; responding to this now just because it's a top-level
> comment and thus easiest to respond to directly.)
> 
> > 
> > >  
> > > 
> > > That all correct?  
> > > 
> > > I feel as if both address and hash could benefit from a comment somewhere
> > > indicating that they mean different things in different parts of the code,
> > with
> > > an explanation as to what those differences are (if this is in the design
> doc,
> > > that's fine, and my apologies for forgetting it).
> > 
> > How the hash is used to locate entries on the table (and doubling of the
> table)
> > is covered on the design doc. But I don't fully follow what you're thinking.
> > 
> > thanks.

Randy Smith (Not in Mondays)

Next round of comments. Made it all the way through index_table.cc this time! :-} https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index_table.cc ...

7 years ago (2013-12-02 21:48:05 UTC) #38

Next round of comments.  Made it all the way through index_table.cc this time! 
:-}

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:454: DCHECK_GE(extra_size, 0);
On 2013/11/26 00:32:41, rvargas wrote:
> On 2013/11/25 19:48:07, rdsmith wrote:
> > Does this mean we always have to incrementally grow the extra table in
between
> > doublings of the main table?  Why put that restriction on it?
> 
> Nope. It can be zero.

Huh.  I guess I had a braino; dunno why I thought it was GT.

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:463: memset(params->extra_table, 0, extra_size
* sizeof(IndexBucket));
On 2013/11/26 00:32:41, rvargas wrote:
> On 2013/11/25 19:48:07, rdsmith wrote:
> > It feels to me like there are assumptions in this code about aliasing, but
I'm
> > not certain what they are.   Let me go over what I believe the context for
> this
> > code is and make comments conditional on that.  
> > 
> > I believe that this function will be called in only three situations:
> > * First initialization.  main_table_ will be null, params->main_table will
> point
> > to memory mapped from a file, and params->extra_table will be null.
> 
> params->extra_table should not be null.

So what's the size of the extra_table on init?  Just the header?

> 
> > * Growth of extra table.  params->main_table will be NULL.  
> params->extra_table
> > != extra_table_, but will point to a memory mapped region of the same file,
> only
> > larger.  Thus any changes made to params->extra_table will be reflected in
> > extra_table_ and vice versa (for the length that extra_table_ is mapped
for). 
> > extra_table_ will be unmapped after this function returns (does the calling
> code
> > have it's own pointer to extra_table_?)
> 
> Yes, someone will know how to unmap the old table after this method returns.
> 
> > * Doubling of main table.  params->main_table != main_table_, but will point
> to
> > a memory mapped region of the same file, only doubled in size. 
> > params->extra_table will point to a section of memory mapped to the extra
> table
> > file, that hasn't changed in size any.  params->extra_table != extra_table_
> and
> > the location pointed to by extra_table_ will be unmapped after this routine
> > finishes (??), but the two memory regions will be direct aliases to one
> another.
> 
> sounds correct.
> 
> The underlying assumption is that the index is memory mapped and follows the
> format (and files) described in disk_format_v3.h, so the act of
re-initializing
> the IndexTable is a consequence of growing some file and creating another map
of
> the new extension. As such, we should never receive the same pointer that we
are
> currently using (but there's no check for that, because it shouldn't really
> matter), and we deal with pointers not mapped sections (in fact all the unit
> tests just allocate memory, never map anything).
> 
> Aliasing is important in that we don't memcpy data from the old mapping to the
> new mapping when the file just grows, as that would be a waste of time. So
yes,
> this has a memory interface, but expects the caller to deal with the files as
> defined by the file format, not just random pointers (hence
> TestCacheTables::CopyFrom simulating that behavior)
> 
> But that should be clear from the comment at the top of index_table.h, right?

Re-reading that comment, I find it more abstract that I'd like; I'd rather the
aliasing assumptions were made explicit, either in the comment to this function
or in the comments to IndexTableInitData.  But see later. 

> > (I'll probably want a comment clarifying these assumptions somewhere, but
I'll
> > wait to suggest what until I'm certain I understand it all. :-})
> > 
> > Assuming that that's correct, why refer to the same data through two
different
> > points (extra_table_ and params->extra_table)?  It seems like it's just
going
> to
> > generate confusion.  
> 
> Do I? I try to refer to the data that logically is being accessed... so for
> instance line 461 copies the old data somewhere safe and line 463 clears the
> storage for the new data. Production code could copy the new data and clear
the
> old data, but that would confuse the reader and prevent the unit tests from
> working :)
> 
> I mean, the whole thing requires mapping (or a judicious emulation from the
> caller), but if this code is misusing source or destination somewhere assuming
> that they are the same, let me know so that it's fixed.
> 
> So, conceptually, the caller could suspend all requests to this class, copy
the
> file to another file and map the new file (so that there is no aliasing)
before
> calling into this class again asking for re-init.

Ah!  Thank you.  That's a (hypothetical) use case that's very useful for me. 
Can we put in a comment (before this function or in IndexTableInitData) that if
this function is being called a second time, the caller of Init guarantees that
the data pointed to by params->main_table will be an improper (i.e. possibly
identical) super-set of what was previously passed, and that if main_table isn't
being grown, params->extra_table will similarly be a super-set of what was
previously passed?  

I'd also like to have the callers responsibility to deallocate the memory that
main_table_ and extra_table_ used to point to made explicit.  If one isn't
thinking about this function in terms of memory mapped files, it's somewhat
surprising; it effectively means that the caller owns that memory, which I don't
think is the default assumption one would have looking at the call.

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:473: DCHECK_LE(extra_bits_, 11);
On 2013/11/26 00:32:41, rvargas wrote:
> On 2013/11/25 19:48:07, rdsmith wrote:
> > What results in this 11?  This seems like you're simply saying that you
can't
> > have a table_len greater than 32 bits, in which case don't you want to just
> say
> > that?  (I.e. IIRTCC the 11 is dependent on a specific value of
kBaseTableLen,
> > and that dependency should be explicit in the DCHECK).
> 
> This is just saying that the table can be doubled 11 times from the smaller
> table  to the biggest table. Yes, the value (11) is derived from kBaseTableLen
> and maybe kMaxAddress or kCellAddressMask, but I thought it was clearer to
have
> the number of bits here (right when we are saying "how many extra bits can we
> have") rather to come out with a probably cryptic formula that derives the
> proper value from other numbers. The cost is of course less flexibility if the
> format change, but at least it will fail right when it is needed. When I was
> writing the code this value changed right here multiple times depending on
what
> I had implemented (and was ready to test) so it seemed better to keep it that
> way instead of tying it to other constants.

Huh.  I don't particularly care about future flexibility (I mean, future
flexibility is good, just not my highest current priority :-}).  But from my
perspective in reading the code, having the cryptic formulas calls out to the
reader what the dependencies between the different constants are.  I had been
thinking of asking if we could have constants defined in disk_format_v3.h for
the different pieces of the IndexCell, and then define the other constants that
need to match those based on those constants.  It means if you're tracing the
"Why is this number this value?" question, you get roadsigns.  

Can you say more about why having an 11 is clearer?  I'm not getting it.  It
doesn't feel like it tells the code reader/maintainer/debugger anything.

> I can define a local constant here if you don't like it.

I'm not interested in "const uint32 kExtraBitsLimit = 11;", if that's what
you're offering; I don't find that any more useful than just using 11.  But I'd
be interested in your thoughts about my general argument above.

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:475: small_table_ = extra_bits_ < kHashShift -
kHashSmallTableShift;
On 2013/11/26 00:32:41, rvargas wrote:
> On 2013/11/25 19:48:07, rdsmith wrote:
> > In other words, while the number of bits that are thrown away before making
> the
> > comparison between the hash and the IndexCell partial hash is greater than
the
> > number of bits used to index into the table, leave the table small, but when
> we
> > start using all those thrown-away bits and crowding into the comparison
> section,
> > switch to the large table?
> 
> I didn't intend to make a strong statement about what this means. It really
just
> means extra_bits_ < 6, as in the table can be doubled 6 times before reaching
> the 64k border between the two formats.
> 
> It's just that those two values are tightly related to each other, and the
only
> difference between them is the table type (small or large), and subtracting
them
> gives me the 6 that I need. Somewhat similar to the previous '11' but this
time
> I'm tying it to other constants that change accordingly (and as far as I
> remember I use this same formula in other places).
> 
> 
> > 
> > (I'll note that this seems like a different metric for the small/large table
> > distinction that I'd previously understood--I thought it was about whether
or
> > not the entries will always be stored in one file.  But looking at
> > disk_format_v3.h, it seems like the distinction is specified purely by the
> > number of entries.  If there's a subtle background connection between these
> > three things, it should probably be made explicit, or if one of them isn't
> > relevant, that should be made clear too.
> 
> The distinction between small and large is basically arbitrary. 64K is an
> interesting number, and it is the base table size of V2, so it was just
natural
> to keep using it. A small table is a new feature of V3, so it covers tables
> smaller than 64K.
> 
> The maximum number of blocks on a block file also happens to be 64k mostly
> because 64k is an interesting number (and it doesn't yield files that are too
> big with the largest block file), and it requires two standard pages (4k each)
> to hold the bitmap. 
> 
> That results in being able to store up to 64k entries on a single file (in
fact,
> slightly less than that). And that just fits in that it allows the implied
file
> feature of small tables, but it is not the cause of the border (although it
> would be hard to move that border somewhere else).
> 
> So things fit, but the "real" distinction between small and large tables is
> arbitrarily a 64k table. That's why some numbers are really "count how many
> times we can do foo", given the arbitrary limits of the format (start at 1k,
> switch at 64k, up to 22 bits)
> 
> > 
> > Based on how much pain it took me to figure out, I'll probably also want an
> > explanatory comment about what hashes mean (in the different contexts)
> > somewhere, but I'll ask for that once I'm confident I actually understand
how
> > they're used.)
> 
This makes sense, but mostly comes down to my wanting to be able to trace
dependencies between constants/distinctions.  I.e. if the defining difference
between a small and a large table is the 64K boundary, I'd love to have (e.g.) a
constant somewhere kMaxSmallTableSize = 64K, and have (e.g.) the difference
between kHashShift and kHashSmallTable shift defined in terms of it.  (I.e. this
is the same comment as the previous.)

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:485: // creating the bitmaps, clear the part of
the extra table.
On 2013/11/25 19:48:07, rdsmith wrote:
> nit, suggestion: "clear the part of the bitmap referring to the extra table"?

Done.

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:492: int main_table_bit_words = ((mask_ >> 1) +
1) * kCellsPerBucket / 32;
This looks to be a factor of two off from the size of the new main table
(buckets indexed from 0..mask_, total num buckets = mask_+1, mask_>>1+1 = 1/2
the total number of buckets, multiply by cells per bucket gives you half the
total number of cells), by which I infer it refers to the old main table.  Maybe
change the variable name to old_main_table_bit_words and/or get it from the
backup_header_?

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:497: DCHECK(growing);
Just confirming that this is redundant with if (old_extra_table) because we
can't be doubling the table if this is the first time Init() is called.  (Not
objecting, just trying to understand the failure mode being guarded against.)

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:499: DCHECK_GT(old_num_words,
main_table_bit_words);
Any reason this couldn't be equal?  (I.e. if we're doubling without having any
extra table in between).

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:515: backup_bitmap_storage_.swap(storage);
What failure is being guarded against by copying and swapping instead of editing
in place?

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:519:
backup_header_.reset(params->backup_header.release());
What's the context for these parameters?  They're only used on the first call to
Init(); are they defined on other calls?  And where do they come from on the
first call to Init()?  What do they contain?  It feels a bit weird to have this
function responsible for allocating storage on growing, but not on first
initialization.

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:586: break;
nit, suggestion: do { } while(bucket_id); would communicate this loop's purpose
to me a bit more clearly.

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:600: for (; !found;) {
This looks redundant with the "if (found) break;" below; I don't think you need
both.  (My nit/suggestion above about do/while would also apply here  if you
liked it above.)

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:669: if (state == ENTRY_FREE) {
Nit, suggestion: If this is a switch statement without a default: label, the
compiler will automatically warn if there isn't a label for each entry in the
enum.  I've found this a useful technique for making sure as code evolves that
all appropriate state conditionals are tested.

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:839: kNumExtraBlocks / 4 : kNumExtraBlocks;
What is kNumExtraBlocks?  I didn't see any comment at the definition site, and
it's only used there and here.  Any rational for the /4 or =, or is that
arbitrary?

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:922: header()->max_bucket = mask_;
nit: This seems like a strange place to do this assignment.  My assumption, when
I saw the call to MoveCells() in Init(), was that all it did was move the cells,
but this is a global state change.  To me, it would more naturally be paired
with the zeroing of extra_table; that's the point at which the extra_table
entries become unavailable to the normal lookup (to be made available again in
this routine as the cells are put in one at a time).

Also, it occurs to me reading this code: Where do we fix up the bucket chains
that are left dangling by folding the extra table into the main table on a
doubling?

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:997: small_table_ = true;
Willing to put a comment in here somewhere about why the flipping back and forth
of small_table_ over the course of this algorithm?

rvargas (doing something else)

On 2013/12/02 19:28:01, rdsmith wrote: > On 2013/11/26 00:48:56, rvargas wrote: > > On 2013/11/25 ...

7 years ago (2013-12-03 00:41:32 UTC) #39

On 2013/12/02 19:28:01, rdsmith wrote:
> On 2013/11/26 00:48:56, rvargas wrote:
> > On 2013/11/25 19:48:07, rdsmith wrote:
> > > I feel as if both address and hash could benefit from a comment somewhere
> > > indicating that they mean different things in different parts of the code,
> > with
> > > an explanation as to what those differences are (if this is in the design
> doc,
> > > that's fine, and my apologies for forgetting it).
> > 
> > How the hash is used to locate entries on the table (and doubling of the
> table)
> > is covered on the design doc. But I don't fully follow what you're thinking.
> > 
> > thanks.
> 
> So the design document does cover what the hash is and how it's used to lookup
> entries (and thank you--it was useful for me to go back to the design doc.  I
> think I get more from it each time I go over it after reading the code, not
> surprisingly).  But my concern is more that the concepts "hash" and "address"
> mean different things at different places in the code, and I think it's easy
for
> the reader to assume that they mean the same thing, and be confused that way. 
> If they were consistently referred to by different names in the code, that
would
> address my concern.  I'm not sure if this should be a high priority--if the
> concepts are clear, it may be that a code reader will be able to track by
> context which meaning of hash/address is currently relevant.  But that's the
> concern I was raising. 
> 
> With regard to addresses, I think my confusion is between what I'll call "full
> addresses" (i.e. what Addr() passes around), and "address values" (what you
get
> from IndexCells and EntryCell::GetAddressValue()).  I think it would be
clearer
> if those things used different names consistently.  Does this make sense to
you?
>  I think that would mean:
>   * Changing the names of {Get,Set}Cell{SmallTable,}Address &
>     {FileNumber,StartBlock}FromAddress, along with the variables used
>     inside those functions?  
>   * I think based on its usage in IsValidAddress, kMaxAddress would also
>     become kMaxAddressValue.

Sure, as I said before, this can be improved by a different name but I have no
good ideas about what to call it.

So here's the whole picture:
disk_cache::Addr -> this is the regular address passed around most of the code
(I mean, this is visible to most files under disk_cache), and it is the general
address of anything (note that the type is not Address)

disk_cache::Addr::value == disk_cache::CacheAddr == uint32. This is what is
stored inside an Addr, and it is just a number of a fixed length. I mentioned
because it has a dedicated type name (and some code uses it, of course), and
because it "has" the nickname of "address value" as it is accessible as that
from any Addr.

And now the new neighbors:
The address of a Cell -> currently also known as "address", "defined" on the
file format as that, and with a variable bit length. Originally meant to be
basically the cell number, or cell index (but index is a very bad name). Should
be fairly private to this file (index_table.cc)

With the added twist that once you have an EntyCell, given that it is meant to
be used from the outside, it must provide a for of address, and so it translates
from cell number to Addr (so Address() returns Addr). But on the other hand, it
cannot hide the short version from other code _from this file_, as this file is
the one who is aware of the details of how many bits go where, so it provides
private methods to get to the "real" cell number.

As such, I would not want to go to AddressValue everywhere (even though today
that's the closest I have to be able to tell that it means a cell number),
because it is too close to Addr::value.

I would be happier if what is stored inside a Cell is a Fraby and so Address()
returns Addr and GetFraby returns Fraby. The caveat of course is that there is a
very simple translation from Fraby to CacheAddr.

How about Location for anything that relates to the cell number?

> 
> With regard to hash, similarly there's the portion of the hash that's stored
in
> IndexCell (call it "high hash"), the portion that's implied by the position of
> the IndexCell within the IndexTable ("low hash"), and the whole thing (gotten
> from base::Hash() and split up as needed, "full hash").  It looks to me as if
> what I'm calling the low hash is generally referred to as a bucket hash, so it
> may be the only real confusion is between high hash and full hash (for
example,
> the EntryCell constructor takes a full hash and passes a high hash to
> SetCellHash, storing the full hash in hash_).  I'm not sure what would be
> involved in clearing this up or if it's worth it, but I suspect that just
using
> different words for the high hash and the full hash (whatever they are) would
> reduce confusion in code readers.

Ah, now I see what you meant.

On the same lines, I could rename the thing that lives inside a cell as "id" as
in it is what identifies whatever is stored in that cell. It may be a little
weird how the id is derived from part of the hash (versus it _is_ part of the
hash), but the cell doesn't really care what is there (not that it cares about
any other field, but I digress)

uint32 IndexTable::GetFullHash(const IndexCell& cell, uint32 lower_part) {
  // It is OK for the high order bits of lower_part to overlap with the stored
  // part of the hash.
  if (small_table_)
    return (GetCellSmallTableId(cell) << kHashSmallTableShift) | lower_part;

  return (GetCellId(cell) << kHashShift) | lower_part;
}

That would change bucket_hash to something else (and bucket_id, cell_id)...
maybe bucket_hash becomes bucket_id.

rvargas (doing something else)

Thanks again https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index_table.cc File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index_table.cc#newcode463 net/disk_cache/v3/index_table.cc:463: memset(params->extra_table, 0, extra_size * sizeof(IndexBucket)); On 2013/12/02 ...

7 years ago (2013-12-04 01:04:16 UTC) #40

Thanks again

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:463: memset(params->extra_table, 0, extra_size
* sizeof(IndexBucket));
On 2013/12/02 21:48:06, rdsmith wrote:
> On 2013/11/26 00:32:41, rvargas wrote:
> > On 2013/11/25 19:48:07, rdsmith wrote:
> > > It feels to me like there are assumptions in this code about aliasing, but
> I'm
> > > not certain what they are.   Let me go over what I believe the context for
> > this
> > > code is and make comments conditional on that.  
> > > 
> > > I believe that this function will be called in only three situations:
> > > * First initialization.  main_table_ will be null, params->main_table will
> > point
> > > to memory mapped from a file, and params->extra_table will be null.
> > 
> > params->extra_table should not be null.
> 
> So what's the size of the extra_table on init?  Just the header?

There is no header on the extra table, but it has to have some space there to be
able to store something. The reason is that no matter how big the main table is,
it may take as little as 5 entries to require space on the extra table. Backend
unit tests start with something like 8 cells, and production code starts with
1/8th of the main table.

> 
> > 
> > > * Growth of extra table.  params->main_table will be NULL.  
> > params->extra_table
> > > != extra_table_, but will point to a memory mapped region of the same
file,
> > only
> > > larger.  Thus any changes made to params->extra_table will be reflected in
> > > extra_table_ and vice versa (for the length that extra_table_ is mapped
> for). 
> > > extra_table_ will be unmapped after this function returns (does the
calling
> > code
> > > have it's own pointer to extra_table_?)
> > 
> > Yes, someone will know how to unmap the old table after this method returns.
> > 
> > > * Doubling of main table.  params->main_table != main_table_, but will
point
> > to
> > > a memory mapped region of the same file, only doubled in size. 
> > > params->extra_table will point to a section of memory mapped to the extra
> > table
> > > file, that hasn't changed in size any.  params->extra_table !=
extra_table_
> > and
> > > the location pointed to by extra_table_ will be unmapped after this
routine
> > > finishes (??), but the two memory regions will be direct aliases to one
> > another.
> > 
> > sounds correct.
> > 
> > The underlying assumption is that the index is memory mapped and follows the
> > format (and files) described in disk_format_v3.h, so the act of
> re-initializing
> > the IndexTable is a consequence of growing some file and creating another
map
> of
> > the new extension. As such, we should never receive the same pointer that we
> are
> > currently using (but there's no check for that, because it shouldn't really
> > matter), and we deal with pointers not mapped sections (in fact all the unit
> > tests just allocate memory, never map anything).
> > 
> > Aliasing is important in that we don't memcpy data from the old mapping to
the
> > new mapping when the file just grows, as that would be a waste of time. So
> yes,
> > this has a memory interface, but expects the caller to deal with the files
as
> > defined by the file format, not just random pointers (hence
> > TestCacheTables::CopyFrom simulating that behavior)
> > 
> > But that should be clear from the comment at the top of index_table.h,
right?
> 
> Re-reading that comment, I find it more abstract that I'd like; I'd rather the
> aliasing assumptions were made explicit, either in the comment to this
function
> or in the comments to IndexTableInitData.  But see later. 

I'm a little worried about this comment. The header mentions multiple files that
can be remapped while the cache is working and that management is external to
this file, so it looks pretty concrete to me. Without the files being mmapped,
the whole IndexTable would 
be completely impractical so that's not exactly a detail.

I'll extend the comment for Init()


> 
> > > (I'll probably want a comment clarifying these assumptions somewhere, but
> I'll
> > > wait to suggest what until I'm certain I understand it all. :-})
> > > 
> > > Assuming that that's correct, why refer to the same data through two
> different
> > > points (extra_table_ and params->extra_table)?  It seems like it's just
> going
> > to
> > > generate confusion.  
> > 
> > Do I? I try to refer to the data that logically is being accessed... so for
> > instance line 461 copies the old data somewhere safe and line 463 clears the
> > storage for the new data. Production code could copy the new data and clear
> the
> > old data, but that would confuse the reader and prevent the unit tests from
> > working :)
> > 
> > I mean, the whole thing requires mapping (or a judicious emulation from the
> > caller), but if this code is misusing source or destination somewhere
assuming
> > that they are the same, let me know so that it's fixed.
> > 
> > So, conceptually, the caller could suspend all requests to this class, copy
> the
> > file to another file and map the new file (so that there is no aliasing)
> before
> > calling into this class again asking for re-init.
> 
> Ah!  Thank you.  That's a (hypothetical) use case that's very useful for me. 
> Can we put in a comment (before this function or in IndexTableInitData) that
if
> this function is being called a second time, the caller of Init guarantees
that
> the data pointed to by params->main_table will be an improper (i.e. possibly
> identical) super-set of what was previously passed, and that if main_table
isn't
> being grown, params->extra_table will similarly be a super-set of what was
> previously passed?  
> 
> I'd also like to have the callers responsibility to deallocate the memory that
> main_table_ and extra_table_ used to point to made explicit.  If one isn't
> thinking about this function in terms of memory mapped files, it's somewhat
> surprising; it effectively means that the caller owns that memory, which I
don't
> think is the default assumption one would have looking at the call.
> 

The memory ownership should be clear by the definition of IndexTableInitData.
There is ownership transfer for the backups, but not for the bitmap or tables.
That's the (relatively new) coding standard so it feels a little weird if that
is repeated by comments.

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:473: DCHECK_LE(extra_bits_, 11);
On 2013/12/02 21:48:06, rdsmith wrote:
> On 2013/11/26 00:32:41, rvargas wrote:
> > On 2013/11/25 19:48:07, rdsmith wrote:
> > > What results in this 11?  This seems like you're simply saying that you
> can't
> > > have a table_len greater than 32 bits, in which case don't you want to
just
> > say
> > > that?  (I.e. IIRTCC the 11 is dependent on a specific value of
> kBaseTableLen,
> > > and that dependency should be explicit in the DCHECK).
> > 
> > This is just saying that the table can be doubled 11 times from the smaller
> > table  to the biggest table. Yes, the value (11) is derived from
kBaseTableLen
> > and maybe kMaxAddress or kCellAddressMask, but I thought it was clearer to
> have
> > the number of bits here (right when we are saying "how many extra bits can
we
> > have") rather to come out with a probably cryptic formula that derives the
> > proper value from other numbers. The cost is of course less flexibility if
the
> > format change, but at least it will fail right when it is needed. When I was
> > writing the code this value changed right here multiple times depending on
> what
> > I had implemented (and was ready to test) so it seemed better to keep it
that
> > way instead of tying it to other constants.
> 
> Huh.  I don't particularly care about future flexibility (I mean, future
> flexibility is good, just not my highest current priority :-}).  But from my
> perspective in reading the code, having the cryptic formulas calls out to the
> reader what the dependencies between the different constants are.  I had been
> thinking of asking if we could have constants defined in disk_format_v3.h for
> the different pieces of the IndexCell, and then define the other constants
that
> need to match those based on those constants.  It means if you're tracing the
> "Why is this number this value?" question, you get roadsigns.  
> 
> Can you say more about why having an 11 is clearer?  I'm not getting it.  It
> doesn't feel like it tells the code reader/maintainer/debugger anything.
>  
> > I can define a local constant here if you don't like it.
> 
> I'm not interested in "const uint32 kExtraBitsLimit = 11;", if that's what
> you're offering; I don't find that any more useful than just using 11.  But
I'd
> be interested in your thoughts about my general argument above.
> 
> 

My best example would be the failure of the formula two lines below this one.

I can come up with something like the log of the largest table divided by the
smallest table, but then that rises the issue of deriving the largest table. I
hear what you are saying and I generally agree, to the point of having a bunch
of constants with meaningful names whose values can be derived easily by just
inspection (the reason for the constants here).

In this particular case, we have a line that calculates a number from the
comparison of the current table versus the smallest table. The first check says
we cannot have a negative number (a table smaller than the smallest), the second
line checks the value against an upper bound, and that should tell right away
that we are checking for the largest possible table (right?)

The only question is then why the largest table is 11 times larger than the
smallest table, or in other words, what is the size of the largest table.
Honestly, if I'm reading some code that I didn't write, I would not care much
about the value of that number (the actual limit for the test): I would mostly
care for some limit being there, either on a check, or in the code that uses
that number. In this case, it would mean to me the same if that value is 8 or
15, or derived from a formula. 8 could be wrong, the formula could be off by one
and in any case I would not spend time looking at the value unless I'm actually
trying to debug a problem and I have reason to suspect that the max value could
be at fault. And if I'm doing that, I would probably ignore the formula, figure
out what the value should be and compare that against what the code says (number
of bits on an address = 22, so 21 of them go to the main table; smaller table =
1024 (10 bits), so delta = 21 - 10)

And that's why I prefer a plain number here rather than a formula... given that
too much may be read from the formula (as in the check against 6). What I think
it matters is not the value, but the check itself (double check an upper limit,
whatever that is)

I hope I don't sound combative, that's not my intention. I just want to answer
the question before we setup in what to do.

I'm up for moving relevant constants to the file format header. That's why
kBaseTableLen is there. Shifts and masks I'm not so sure as they make more sense
with the code that uses them, and I don't want to make them visible to any code
outside this file.

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:492: int main_table_bit_words = ((mask_ >> 1) +
1) * kCellsPerBucket / 32;
On 2013/12/02 21:48:06, rdsmith wrote:
> This looks to be a factor of two off from the size of the new main table
> (buckets indexed from 0..mask_, total num buckets = mask_+1, mask_>>1+1 = 1/2
> the total number of buckets, multiply by cells per bucket gives you half the
> total number of cells), by which I infer it refers to the old main table. 
Maybe
> change the variable name to old_main_table_bit_words and/or get it from the
> backup_header_?

Done.

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:497: DCHECK(growing);
On 2013/12/02 21:48:06, rdsmith wrote:
> Just confirming that this is redundant with if (old_extra_table) because we
> can't be doubling the table if this is the first time Init() is called.  (Not
> objecting, just trying to understand the failure mode being guarded against.)

correct. Protects against the first part of this function :)

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:499: DCHECK_GT(old_num_words,
main_table_bit_words);
On 2013/12/02 21:48:06, rdsmith wrote:
> Any reason this couldn't be equal?  (I.e. if we're doubling without having any
> extra table in between).

An extra table is always required.

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:515: backup_bitmap_storage_.swap(storage);
On 2013/12/02 21:48:06, rdsmith wrote:
> What failure is being guarded against by copying and swapping instead of
editing
> in place?

The bitmap itself is supposed to be growing so we need a bigger backup

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:519:
backup_header_.reset(params->backup_header.release());
On 2013/12/02 21:48:06, rdsmith wrote:
> What's the context for these parameters?  They're only used on the first call
to
> Init(); are they defined on other calls?  And where do they come from on the
> first call to Init()?  What do they contain?  It feels a bit weird to have
this
> function responsible for allocating storage on growing, but not on first
> initialization.

There are dedicated files to store the backups. When the cache starts, those
files are read and the contents are provided to this method, so that we can
check the backup against the mmaped bitmap. When growing, there is no need to
read any file because the up to date version is already in memory, owned by this
object.

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:586: break;
On 2013/12/02 21:48:06, rdsmith wrote:
> nit, suggestion: do { } while(bucket_id); would communicate this loop's
purpose
> to me a bit more clearly.

Done.

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:600: for (; !found;) {
On 2013/12/02 21:48:06, rdsmith wrote:
> This looks redundant with the "if (found) break;" below; I don't think you
need
> both.  (My nit/suggestion above about do/while would also apply here  if you
> liked it above.)

Done.

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:669: if (state == ENTRY_FREE) {
On 2013/12/02 21:48:06, rdsmith wrote:
> Nit, suggestion: If this is a switch statement without a default: label, the
> compiler will automatically warn if there isn't a label for each entry in the
> enum.  I've found this a useful technique for making sure as code evolves that
> all appropriate state conditionals are tested.

Done.

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:839: kNumExtraBlocks / 4 : kNumExtraBlocks;
On 2013/12/02 21:48:06, rdsmith wrote:
> What is kNumExtraBlocks?  I didn't see any comment at the definition site, and
> it's only used there and here.  Any rational for the /4 or =, or is that
> arbitrary?

Completely arbitrary (well, /4 is tied to *2). kNumExtraBlocks is just how many
extra blocks should be allocated each time the extra table grows. So this says,
for tables larger than 2048, add 256 bloacks each time. Otherwise, add 1024
blocks.

So this means the tables total size goes 1024 - 1024+256 - 1024+512 - 1024+768 -
2048 - 3072 - 4096... or something like that (because extra doesn't really start
at 0)

btw, kNumExtraBlocks also directs how slowly each block file grows.

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:922: header()->max_bucket = mask_;
On 2013/12/02 21:48:06, rdsmith wrote:
> nit: This seems like a strange place to do this assignment.  My assumption,
when
> I saw the call to MoveCells() in Init(), was that all it did was move the
cells,
> but this is a global state change.  To me, it would more naturally be paired
> with the zeroing of extra_table; that's the point at which the extra_table
> entries become unavailable to the normal lookup (to be made available again in
> this routine as the cells are put in one at a time).

Sort of. The state of this object is in flux while Init is running, and this
method only runs inside Init, so is not like the internal state should be
consistent when this method starts. This method receives a pointer to the extra
table because the core of Init is switching tables and bitmaps and clearing
buffers for this method. But it seems a little silly to clear a member in Init
only to pass its value as an explicit argument somewhere else that would be in
charge of setting the final value for the member. I mean, the final value of
max_bucket is only stable when MoveCells completes so it seemed better to
delegate all actions related to that field to this method. (as always, let me
know if you feel strongly about doing that on Init)


> 
> Also, it occurs to me reading this code: Where do we fix up the bucket chains
> that are left dangling by folding the extra table into the main table on a
> doubling?

I don't follow the question. We take a chain one at a time and follow it to the
end, moving some entries from the main table and all entries from the extra
table (to the new extra table)

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:997: small_table_ = true;
On 2013/12/02 21:48:06, rdsmith wrote:
> Willing to put a comment in here somewhere about why the flipping back and
forth
> of small_table_ over the course of this algorithm?

yeah, not so happy about having to do that. Added a comment.

Randy Smith (Not in Mondays)

On 2013/12/02 21:01:49, rvargas wrote: > On 2013/12/02 18:53:42, rdsmith wrote: > > On 2013/11/26 ...

7 years ago (2013-12-04 20:59:34 UTC) #41

On 2013/12/02 21:01:49, rvargas wrote:
> On 2013/12/02 18:53:42, rdsmith wrote:
> > On 2013/11/26 00:48:56, rvargas wrote:
> > > On 2013/11/25 19:48:07, rdsmith wrote:
> > > > Part-way through reviewing Init(), and getting somewhat bogged down in
> > > > understanding the use of hash values, so pausing to provide these
comments
> > and
> > > > ask for clarifications on hash.  I believe that:
> > > > * The hash for a particular key is 32 bits, generated by base::Hash(key)
> > > > * The bottom bits (a variable number of bits, depending on
> table_len/mask_)
> > > are
> > > > used as the address of the relevant bucket for that entry.
> > > > * The top bits (32 - kHash{SmallTable,}Shift) are stored in the
IndexCell
> > and
> > > > compared on cell lookup.  That shift means that we're comparing just the
> > > number
> > > > of bits stored in the table for each size; i.e. kHash{SmallTable,}Shift
is
> > > > dependent on the number of bits stored for the hash in the different
types
> > of
> > > > table/cells.
> > > > * The top and bottom bits in this scheme may overlap.
> > > > * When we increase the table size via doubling (i.e. increase the number
> of
> > > > buckets), that may mean that the extra bit (or bits, if we haven't
touched
> a
> > > > cell from previous doublings) used in indexing into the table may be
> wrong,
> > > and
> > > > the cell may need to be moved.  This is what the MisplacedCell() logic
is
> > > about.
> > > 
> > > Roughly yes. As the table grows, bits move from the "we need to store this
> > part"
> > > to the "this part is implied given the position in the table". There is no
> > need
> > > to store store less and less bits each time (and complicate all the logic)
> so
> > > that means that each time we double the size, there's another bit that
> becomes
> > > redundant.
> > > 
> > > When we double the table, we move all cells that have to moved to the
upper
> > half
> > > of the table (about half the cells should move). If that goes well, there
> will
> > > never be a misplaced cell. However, if there is a crash while we are
moving
> > > cells (even though this is a memory-only operation), we could end up with
> some
> > > cells on the right place and some cells in the wrong place... those
> misplaced
> > > cells are detected by a mismatch on the redundant bits, and can be moved
as
> we
> > > see them.
> > 
> > That makes sense.  Where is the logic that moves the cells on table
doubling? 
> > Scanning Init(), I only see a call to MoveCells for the old_extra_table
cells.
>  
> 
> At the end of Init(), there is a call to MoveCells(). MoveSingleCell() goes
> through a loop and calls MoveSingleCell() when needed... note that the
function
> goes through main_table_, even though it needs the caller to provide a pointer
> to the old_extra_table.

Right--I had incorrectly thought that MoveCells() was only about the extra
table.  Thanks for the pointer.

> 
> 
> > 
> > Two flags get raised in my brain around crash handling (== unlikely but not
> > impossible corner case) code.  The first is that it needs to be tested
> > thoroughly, because our chances of noticing problems in it from more general
> > testing and the field is lower.  Quickly skimming the unit test, I'm not
sure
> I
> > see anything that will result in exercising the
> > MisplacedHash/HandleMisplacedHash logic; am I missing it?   
> > 
> > The second is that code to handle rare corner cases can sometimes result in
> > masking other bugs.  It'd be nice if there was some way to disable the
> misplaced
> > hash logic for the tests that aren't trying to emulate the crashing code.   
> > 
> 
> The tests are not going through inconsistent states or checks at this point...
> some of the checks performed will crash the test if they fail (because most
> tests run without a backend)... and the tests themselves are not generating
bad
> data (but I must admit that was a manual process... as in they did for a
while,
> and then some step of the test will fail, as not being able to locate
> something).
> 
> I'm not too worried at this point with error recovery paths not covered by
> tests; I have not even written all those paths to start with, and it is more
> important to bring the whole system up and start running unit tests of the
whole
> thing.

I'm happy to go with postponing corner case testing until later in the process,
but I would sorta like to keep to the invariant that code has tests associated
with it when it's committed.  Does that make sense to you?  Any easy way to add
tests for the misplaced cell logic?

> 
> 
> > WDYT?
> > 
> > (Still working on review; responding to this now just because it's a
top-level
> > comment and thus easiest to respond to directly.)
> > 
> > > 
> > > >  
> > > > 
> > > > That all correct?  
> > > > 
> > > > I feel as if both address and hash could benefit from a comment
somewhere
> > > > indicating that they mean different things in different parts of the
code,
> > > with
> > > > an explanation as to what those differences are (if this is in the
design
> > doc,
> > > > that's fine, and my apologies for forgetting it).
> > > 
> > > How the hash is used to locate entries on the table (and doubling of the
> > table)
> > > is covered on the design doc. But I don't fully follow what you're
thinking.
> > > 
> > > thanks.

Randy Smith (Not in Mondays)

Responses to your comments; continuing my review. https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index_table.cc File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index_table.cc#newcode463 net/disk_cache/v3/index_table.cc:463: memset(params->extra_table, 0, ...

7 years ago (2013-12-05 19:17:47 UTC) #42

Responses to your comments; continuing my review.

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:463: memset(params->extra_table, 0, extra_size
* sizeof(IndexBucket));
On 2013/12/04 01:04:17, rvargas wrote:
> On 2013/12/02 21:48:06, rdsmith wrote:
> > On 2013/11/26 00:32:41, rvargas wrote:
> > > On 2013/11/25 19:48:07, rdsmith wrote:
> > > > It feels to me like there are assumptions in this code about aliasing,
but
> > I'm
> > > > not certain what they are.   Let me go over what I believe the context
for
> > > this
> > > > code is and make comments conditional on that.  
> > > > 
> > > > I believe that this function will be called in only three situations:
> > > > * First initialization.  main_table_ will be null, params->main_table
will
> > > point
> > > > to memory mapped from a file, and params->extra_table will be null.
> > > 
> > > params->extra_table should not be null.
> > 
> > So what's the size of the extra_table on init?  Just the header?
> 
> There is no header on the extra table, but it has to have some space there to
be
> able to store something. The reason is that no matter how big the main table
is,
> it may take as little as 5 entries to require space on the extra table.
Backend
> unit tests start with something like 8 cells, and production code starts with
> 1/8th of the main table.
> 
> > 
> > > 
> > > > * Growth of extra table.  params->main_table will be NULL.  
> > > params->extra_table
> > > > != extra_table_, but will point to a memory mapped region of the same
> file,
> > > only
> > > > larger.  Thus any changes made to params->extra_table will be reflected
in
> > > > extra_table_ and vice versa (for the length that extra_table_ is mapped
> > for). 
> > > > extra_table_ will be unmapped after this function returns (does the
> calling
> > > code
> > > > have it's own pointer to extra_table_?)
> > > 
> > > Yes, someone will know how to unmap the old table after this method
returns.
> > > 
> > > > * Doubling of main table.  params->main_table != main_table_, but will
> point
> > > to
> > > > a memory mapped region of the same file, only doubled in size. 
> > > > params->extra_table will point to a section of memory mapped to the
extra
> > > table
> > > > file, that hasn't changed in size any.  params->extra_table !=
> extra_table_
> > > and
> > > > the location pointed to by extra_table_ will be unmapped after this
> routine
> > > > finishes (??), but the two memory regions will be direct aliases to one
> > > another.
> > > 
> > > sounds correct.
> > > 
> > > The underlying assumption is that the index is memory mapped and follows
the
> > > format (and files) described in disk_format_v3.h, so the act of
> > re-initializing
> > > the IndexTable is a consequence of growing some file and creating another
> map
> > of
> > > the new extension. As such, we should never receive the same pointer that
we
> > are
> > > currently using (but there's no check for that, because it shouldn't
really
> > > matter), and we deal with pointers not mapped sections (in fact all the
unit
> > > tests just allocate memory, never map anything).
> > > 
> > > Aliasing is important in that we don't memcpy data from the old mapping to
> the
> > > new mapping when the file just grows, as that would be a waste of time. So
> > yes,
> > > this has a memory interface, but expects the caller to deal with the files
> as
> > > defined by the file format, not just random pointers (hence
> > > TestCacheTables::CopyFrom simulating that behavior)
> > > 
> > > But that should be clear from the comment at the top of index_table.h,
> right?
> > 
> > Re-reading that comment, I find it more abstract that I'd like; I'd rather
the
> > aliasing assumptions were made explicit, either in the comment to this
> function
> > or in the comments to IndexTableInitData.  But see later. 
> 
> I'm a little worried about this comment. The header mentions multiple files
that
> can be remapped while the cache is working and that management is external to
> this file, so it looks pretty concrete to me. Without the files being mmapped,
> the whole IndexTable would 
> be completely impractical so that's not exactly a detail.
> 
> I'll extend the comment for Init()

I'm good with the IndexTable code explicitly relying on the memory mapping.  My
preference would be that that memory mapping and the aliasing it implies was
called out actually in a comment near the code, but if you'd prefer a pointer to
the design doc I'm ok with that.

> 
> 
> > 
> > > > (I'll probably want a comment clarifying these assumptions somewhere,
but
> > I'll
> > > > wait to suggest what until I'm certain I understand it all. :-})
> > > > 
> > > > Assuming that that's correct, why refer to the same data through two
> > different
> > > > points (extra_table_ and params->extra_table)?  It seems like it's just
> > going
> > > to
> > > > generate confusion.  
> > > 
> > > Do I? I try to refer to the data that logically is being accessed... so
for
> > > instance line 461 copies the old data somewhere safe and line 463 clears
the
> > > storage for the new data. Production code could copy the new data and
clear
> > the
> > > old data, but that would confuse the reader and prevent the unit tests
from
> > > working :)
> > > 
> > > I mean, the whole thing requires mapping (or a judicious emulation from
the
> > > caller), but if this code is misusing source or destination somewhere
> assuming
> > > that they are the same, let me know so that it's fixed.
> > > 
> > > So, conceptually, the caller could suspend all requests to this class,
copy
> > the
> > > file to another file and map the new file (so that there is no aliasing)
> > before
> > > calling into this class again asking for re-init.
> > 
> > Ah!  Thank you.  That's a (hypothetical) use case that's very useful for me.

> > Can we put in a comment (before this function or in IndexTableInitData) that
> if
> > this function is being called a second time, the caller of Init guarantees
> that
> > the data pointed to by params->main_table will be an improper (i.e. possibly
> > identical) super-set of what was previously passed, and that if main_table
> isn't
> > being grown, params->extra_table will similarly be a super-set of what was
> > previously passed?  
> > 
> > I'd also like to have the callers responsibility to deallocate the memory
that
> > main_table_ and extra_table_ used to point to made explicit.  If one isn't
> > thinking about this function in terms of memory mapped files, it's somewhat
> > surprising; it effectively means that the caller owns that memory, which I
> don't
> > think is the default assumption one would have looking at the call.
> > 
> 
> The memory ownership should be clear by the definition of IndexTableInitData.
> There is ownership transfer for the backups, but not for the bitmap or tables.
> That's the (relatively new) coding standard so it feels a little weird if that
> is repeated by comments.

Ok, that makes sense.
>

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:473: DCHECK_LE(extra_bits_, 11);
On 2013/12/04 01:04:17, rvargas wrote:
> On 2013/12/02 21:48:06, rdsmith wrote:
> > On 2013/11/26 00:32:41, rvargas wrote:
> > > On 2013/11/25 19:48:07, rdsmith wrote:
> > > > What results in this 11?  This seems like you're simply saying that you
> > can't
> > > > have a table_len greater than 32 bits, in which case don't you want to
> just
> > > say
> > > > that?  (I.e. IIRTCC the 11 is dependent on a specific value of
> > kBaseTableLen,
> > > > and that dependency should be explicit in the DCHECK).
> > > 
> > > This is just saying that the table can be doubled 11 times from the
smaller
> > > table  to the biggest table. Yes, the value (11) is derived from
> kBaseTableLen
> > > and maybe kMaxAddress or kCellAddressMask, but I thought it was clearer to
> > have
> > > the number of bits here (right when we are saying "how many extra bits can
> we
> > > have") rather to come out with a probably cryptic formula that derives the
> > > proper value from other numbers. The cost is of course less flexibility if
> the
> > > format change, but at least it will fail right when it is needed. When I
was
> > > writing the code this value changed right here multiple times depending on
> > what
> > > I had implemented (and was ready to test) so it seemed better to keep it
> that
> > > way instead of tying it to other constants.
> > 
> > Huh.  I don't particularly care about future flexibility (I mean, future
> > flexibility is good, just not my highest current priority :-}).  But from my
> > perspective in reading the code, having the cryptic formulas calls out to
the
> > reader what the dependencies between the different constants are.  I had
been
> > thinking of asking if we could have constants defined in disk_format_v3.h
for
> > the different pieces of the IndexCell, and then define the other constants
> that
> > need to match those based on those constants.  It means if you're tracing
the
> > "Why is this number this value?" question, you get roadsigns.  
> > 
> > Can you say more about why having an 11 is clearer?  I'm not getting it.  It
> > doesn't feel like it tells the code reader/maintainer/debugger anything.
> >  
> > > I can define a local constant here if you don't like it.
> > 
> > I'm not interested in "const uint32 kExtraBitsLimit = 11;", if that's what
> > you're offering; I don't find that any more useful than just using 11.  But
> I'd
> > be interested in your thoughts about my general argument above.
> > 
> > 
> 
> My best example would be the failure of the formula two lines below this one.
> 
> I can come up with something like the log of the largest table divided by the
> smallest table, but then that rises the issue of deriving the largest table. I
> hear what you are saying and I generally agree, to the point of having a bunch
> of constants with meaningful names whose values can be derived easily by just
> inspection (the reason for the constants here).
> 
> In this particular case, we have a line that calculates a number from the
> comparison of the current table versus the smallest table. The first check
says
> we cannot have a negative number (a table smaller than the smallest), the
second
> line checks the value against an upper bound, and that should tell right away
> that we are checking for the largest possible table (right?)
> 
> The only question is then why the largest table is 11 times larger than the
> smallest table, or in other words, what is the size of the largest table.
> Honestly, if I'm reading some code that I didn't write, I would not care much
> about the value of that number (the actual limit for the test): I would mostly
> care for some limit being there, either on a check, or in the code that uses
> that number. In this case, it would mean to me the same if that value is 8 or
> 15, or derived from a formula. 8 could be wrong, the formula could be off by
one
> and in any case I would not spend time looking at the value unless I'm
actually
> trying to debug a problem and I have reason to suspect that the max value
could
> be at fault. And if I'm doing that, I would probably ignore the formula,
figure
> out what the value should be and compare that against what the code says
(number
> of bits on an address = 22, so 21 of them go to the main table; smaller table
=
> 1024 (10 bits), so delta = 21 - 10)
> 
> And that's why I prefer a plain number here rather than a formula... given
that
> too much may be read from the formula (as in the check against 6). What I
think
> it matters is not the value, but the check itself (double check an upper
limit,
> whatever that is)
> 
> I hope I don't sound combative, that's not my intention. I just want to answer
> the question before we setup in what to do.

Nope, you don't sound combative, and I appreciate the detailed explanation.

> 
> I'm up for moving relevant constants to the file format header. That's why
> kBaseTableLen is there. Shifts and masks I'm not so sure as they make more
sense
> with the code that uses them, and I don't want to make them visible to any
code
> outside this file.

I think most of my feeling on this topic would be addressed by making explicit
the relationships between the different constants--I don't have strong general
feelings about either where the constants are declared or this particular
location being a number rather than a constant 11.  But I would like constants
that are used in multiple places in the code and do have a relationship to each
other to be expressed in terms of each other.  So for instance, disk_format_v3.h
would have k{SmallTable,}HashOffset and k{SmallTable,}HashLength, and
kHash{SmallTable,}Shift and kCell{SmallTable,}HashMask would be defined from
those values (probably in index_table.cc).  Would you be good with that?

Having said all that, the place I diverge from you in modeling this code being
read by someone who doesn't know it is, if they suspected the max value would be
at fault, the formula would give them a hint as to how that max was derived. 
That hint might be wrong, but it's structure that would lead them further into
the code, and point at the logic that they would need to vet.  But as I say, I
more care about the relationship between the defined constants being clear than
about this particular number being a formula.

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:499: DCHECK_GT(old_num_words,
main_table_bit_words);
On 2013/12/04 01:04:17, rvargas wrote:
> On 2013/12/02 21:48:06, rdsmith wrote:
> > Any reason this couldn't be equal?  (I.e. if we're doubling without having
any
> > extra table in between).
> 
> An extra table is always required.

Reasonable to have a DCHECK that we're getting an extra table, maybe with a
quick comment that we always need a little bit of space in the extra table?

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:519:
backup_header_.reset(params->backup_header.release());
On 2013/12/04 01:04:17, rvargas wrote:
> On 2013/12/02 21:48:06, rdsmith wrote:
> > What's the context for these parameters?  They're only used on the first
call
> to
> > Init(); are they defined on other calls?  And where do they come from on the
> > first call to Init()?  What do they contain?  It feels a bit weird to have
> this
> > function responsible for allocating storage on growing, but not on first
> > initialization.
> 
> There are dedicated files to store the backups. When the cache starts, those
> files are read and the contents are provided to this method, so that we can
> check the backup against the mmaped bitmap. When growing, there is no need to
> read any file because the up to date version is already in memory, owned by
this
> object.

So that means at some point in the future there will be code checking the backup
bitmaps against the current bitmaps on first load?

Also, what's the allocation behavior for params->backup_header and backup_bitmap
on calls to Init() after the first?  There's not a code correctness issue, but
it's sorta a shame if they keep getting allocated and then deleted without being
examined.

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:922: header()->max_bucket = mask_;
On 2013/12/04 01:04:17, rvargas wrote:
> On 2013/12/02 21:48:06, rdsmith wrote:
> > nit: This seems like a strange place to do this assignment.  My assumption,
> when
> > I saw the call to MoveCells() in Init(), was that all it did was move the
> cells,
> > but this is a global state change.  To me, it would more naturally be paired
> > with the zeroing of extra_table; that's the point at which the extra_table
> > entries become unavailable to the normal lookup (to be made available again
in
> > this routine as the cells are put in one at a time).
> 
> Sort of. The state of this object is in flux while Init is running, and this
> method only runs inside Init, so is not like the internal state should be
> consistent when this method starts. 

Fair, though also tricky to realize for a code reader, since MoveCells() is
neither next to Init() nor commented to be only a subroutine of Init(). 
Explanatory comment?

> This method receives a pointer to the extra
> table because the core of Init is switching tables and bitmaps and clearing
> buffers for this method. But it seems a little silly to clear a member in Init
> only to pass its value as an explicit argument somewhere else that would be in
> charge of setting the final value for the member. 

So to be clear, the image in my mind was of setting header()->max_bucket =
mask_; in Init() after the call to MoveCells().  I don't see any dependencies on
header()->max_bucket in this routine, so I don't think any explicit arguments
would be required.  It might be worth a comment in Init() that "MoveCells()
depends on the pre-doubling values of header()->max_bucket", but what's really
true is that MoveCells() needs to be understood in the context of Init(); there
isn't a good interface contract between the two routines.


> I mean, the final value of
> max_bucket is only stable when MoveCells completes so it seemed better to
> delegate all actions related to that field to this method. (as always, let me
> know if you feel strongly about doing that on Init)

I don't care strongly; I think at base what I'm reacting to is that MoveCells()
doesn't make sense without knowing the particular point in Init() when it's
called (i.e. it's not particularly easy to define an interface contract).   But
I don't see a way to fix that; I do think pulling it out as a subroutine makes
sense just to keep Init() from being a very long function, but it only sorta
helps on the abstraction level.

> > Also, it occurs to me reading this code: Where do we fix up the bucket
chains
> > that are left dangling by folding the extra table into the main table on a
> > doubling?
> 
> I don't follow the question. We take a chain one at a time and follow it to
the
> end, moving some entries from the main table and all entries from the extra
> table (to the new extra table)

I was wondering when the value of bucket->next was updated for the buckets in
the main table.  It looks to me as if the only two places where the next field
is set are in CreateEntryCell() and in GetNextBucket().  Naively, that would
seem to imply to me that after moving all the cells from the extra table to the
main table in MoveCells(), we've got pointers into invalid parts of the extra
table still dangling from the main table buckets.  If GetNextBucket() is called
on those buckets before any buckets are added to the extra table, that function
will fix/zero that pointer.  But if we add an extra bucket to the extra table
before calling GetNextBucket() on the main table bucket that used to point to
that extra table, we'll have an incorrect pointer into that bucket.  Am I
confused?

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:997: small_table_ = true;
On 2013/12/04 01:04:17, rvargas wrote:
> On 2013/12/02 21:48:06, rdsmith wrote:
> > Willing to put a comment in here somewhere about why the flipping back and
> forth
> > of small_table_ over the course of this algorithm?
> 
> yeah, not so happy about having to do that. Added a comment.

I'm not sure I like this suggestion, but: You could make the small/large table
distinction be based on the small_table_ variable in EntryCell, and pass that
bit directly to CreateEntryCell.  The minus of this, of course, is that it loses
an assertion opportunity that EntryCells and the table they are being saved into
have the same value.  You could (again not sure I like the suggestion, just
thinking it through) have small_table_ in IndexTable be a tri-state, with a
"transitioning" value, and only do the assertion when you aren't transitioning
(or assert as if its a large table on writing, but treat it like a small table
on reading, though I'm not 100% sure that would work).  Just a thought; up to
you what, if anything, you do with it.

https://codereview.chromium.org/53313004/diff/1930001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1930001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:687: break;
This pattern doesn't have much virtue over if conditionals if it has a default
branch; the advantage (at least to me) of using a switch statement is that if
you don't have a default and the enum changes, the compiler warns you that you
have to update the switch statement.  

(Still completely a suggestion; just sharing with you a coding pattern I've
found useful in the past.)

rvargas (doing something else)

This time there is no code update. I'm still considering what to do with the ...

7 years ago (2013-12-07 02:13:44 UTC) #43

This time there is no code update. I'm still considering what to do with the
mismatched entry path.

... and I'll be out next week.

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:499: DCHECK_GT(old_num_words,
main_table_bit_words);
On 2013/12/05 19:17:48, rdsmith wrote:
> On 2013/12/04 01:04:17, rvargas wrote:
> > On 2013/12/02 21:48:06, rdsmith wrote:
> > > Any reason this couldn't be equal?  (I.e. if we're doubling without having
> any
> > > extra table in between).
> > 
> > An extra table is always required.
> 
> Reasonable to have a DCHECK that we're getting an extra table, maybe with a
> quick comment that we always need a little bit of space in the extra table?

Done.

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:519:
backup_header_.reset(params->backup_header.release());
On 2013/12/05 19:17:48, rdsmith wrote:
> On 2013/12/04 01:04:17, rvargas wrote:
> > On 2013/12/02 21:48:06, rdsmith wrote:
> > > What's the context for these parameters?  They're only used on the first
> call
> > to
> > > Init(); are they defined on other calls?  And where do they come from on
the
> > > first call to Init()?  What do they contain?  It feels a bit weird to have
> > this
> > > function responsible for allocating storage on growing, but not on first
> > > initialization.
> > 
> > There are dedicated files to store the backups. When the cache starts, those
> > files are read and the contents are provided to this method, so that we can
> > check the backup against the mmaped bitmap. When growing, there is no need
to
> > read any file because the up to date version is already in memory, owned by
> this
> > object.
> 
> So that means at some point in the future there will be code checking the
backup
> bitmaps against the current bitmaps on first load?

Checking against the backup is already in place. It is performed on demand by
CheckState()


> Also, what's the allocation behavior for params->backup_header and
backup_bitmap
> on calls to Init() after the first?  There's not a code correctness issue, but
> it's sorta a shame if they keep getting allocated and then deleted without
being
> examined.

The caller doesn't pass them. That would require reading the files, and that
doesn't happen for reinit.

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:922: header()->max_bucket = mask_;
On 2013/12/05 19:17:48, rdsmith wrote:
> On 2013/12/04 01:04:17, rvargas wrote:
> > On 2013/12/02 21:48:06, rdsmith wrote:
> > > nit: This seems like a strange place to do this assignment.  My
assumption,
> > when
> > > I saw the call to MoveCells() in Init(), was that all it did was move the
> > cells,
> > > but this is a global state change.  To me, it would more naturally be
paired
> > > with the zeroing of extra_table; that's the point at which the extra_table
> > > entries become unavailable to the normal lookup (to be made available
again
> in
> > > this routine as the cells are put in one at a time).
> > 
> > Sort of. The state of this object is in flux while Init is running, and this
> > method only runs inside Init, so is not like the internal state should be
> > consistent when this method starts. 
> 
> Fair, though also tricky to realize for a code reader, since MoveCells() is
> neither next to Init() nor commented to be only a subroutine of Init(). 
> Explanatory comment?
> 

done

> > This method receives a pointer to the extra
> > table because the core of Init is switching tables and bitmaps and clearing
> > buffers for this method. But it seems a little silly to clear a member in
Init
> > only to pass its value as an explicit argument somewhere else that would be
in
> > charge of setting the final value for the member. 
> 
> So to be clear, the image in my mind was of setting header()->max_bucket =
> mask_; in Init() after the call to MoveCells().  I don't see any dependencies
on

If I understand correctly, that would fail. The act of writing new cells into
the tables (and that happens when old cells are moved), changes max_bucket, as
that is just the last used cell from the extra table (or more exactly, the first
unused bucket). A write to the main table is fine, but any write to the extra
table may move max_bucket forward.

> header()->max_bucket in this routine, so I don't think any explicit arguments
> would be required.  It might be worth a comment in Init() that "MoveCells()
> depends on the pre-doubling values of header()->max_bucket", but what's really
> true is that MoveCells() needs to be understood in the context of Init();
there
> isn't a good interface contract between the two routines.

yeah, I added a comment on MoveCells; let me know if that's not enough.

> 
> 
> > I mean, the final value of
> > max_bucket is only stable when MoveCells completes so it seemed better to
> > delegate all actions related to that field to this method. (as always, let
me
> > know if you feel strongly about doing that on Init)
> 
> I don't care strongly; I think at base what I'm reacting to is that
MoveCells()
> doesn't make sense without knowing the particular point in Init() when it's
> called (i.e. it's not particularly easy to define an interface contract).  
But
> I don't see a way to fix that; I do think pulling it out as a subroutine makes
> sense just to keep Init() from being a very long function, but it only sorta
> helps on the abstraction level.

Exactly. It is more a "now we need to sit down and move stuff, but we are not
really done with init"

> 
> > > Also, it occurs to me reading this code: Where do we fix up the bucket
> chains
> > > that are left dangling by folding the extra table into the main table on a
> > > doubling?
> > 
> > I don't follow the question. We take a chain one at a time and follow it to
> the
> > end, moving some entries from the main table and all entries from the extra
> > table (to the new extra table)
> 
> I was wondering when the value of bucket->next was updated for the buckets in
> the main table.  It looks to me as if the only two places where the next field
> is set are in CreateEntryCell() and in GetNextBucket().  Naively, that would
> seem to imply to me that after moving all the cells from the extra table to
the
> main table in MoveCells(), we've got pointers into invalid parts of the extra
> table still dangling from the main table buckets.  If GetNextBucket() is
called
> on those buckets before any buckets are added to the extra table, that
function
> will fix/zero that pointer.  But if we add an extra bucket to the extra table
> before calling GetNextBucket() on the main table bucket that used to point to
> that extra table, we'll have an incorrect pointer into that bucket.  Am I
> confused?

That sounds correct (seems like there's a bug there).

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:997: small_table_ = true;
On 2013/12/05 19:17:48, rdsmith wrote:
> On 2013/12/04 01:04:17, rvargas wrote:
> > On 2013/12/02 21:48:06, rdsmith wrote:
> > > Willing to put a comment in here somewhere about why the flipping back and
> > forth
> > > of small_table_ over the course of this algorithm?
> > 
> > yeah, not so happy about having to do that. Added a comment.
> 
> I'm not sure I like this suggestion, but: You could make the small/large table
> distinction be based on the small_table_ variable in EntryCell, and pass that
> bit directly to CreateEntryCell.  The minus of this, of course, is that it
loses
> an assertion opportunity that EntryCells and the table they are being saved
into
> have the same value.  You could (again not sure I like the suggestion, just
> thinking it through) have small_table_ in IndexTable be a tri-state, with a
> "transitioning" value, and only do the assertion when you aren't transitioning
> (or assert as if its a large table on writing, but treat it like a small table
> on reading, though I'm not 100% sure that would work).  Just a thought; up to
> you what, if anything, you do with it.

thx

https://codereview.chromium.org/53313004/diff/1930001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1930001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:471: // extra_bits_ is really measured against
table-size specific values.
> So for instance, disk_format_v3.h would have k{SmallTable,}HashOffset
> and k{SmallTable,}HashLength, and kHash{SmallTable,}Shift and
> kCell{SmallTable,}HashMask would be defined from those values
>(probably in index_table.cc).  Would you be good with that?

They look the same to me (as in there is no length now). Right now there is
Offset and Mask; introducing another one to derive one of them from the other
doesn't sound right to me, when all values are derived from  visual inspection
of the bit patterns on the comments of the file format.

If you really want to make them visible in file_format.h I'll do that, but I've
come to regret that the equivalent masks are visible on addr.h, when there is
exactly one compilation unit that needs them.

If what you want is point the reader in the right direction, I can comment out a
formula here. Note that the "real" formula requires logarithms, so it's not like
I can have a real compile time constant that way.

https://codereview.chromium.org/53313004/diff/1930001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:687: break;
On 2013/12/05 19:17:48, rdsmith wrote:
> This pattern doesn't have much virtue over if conditionals if it has a default
> branch; the advantage (at least to me) of using a switch statement is that if
> you don't have a default and the enum changes, the compiler warns you that you
> have to update the switch statement.  
> 
> (Still completely a suggestion; just sharing with you a coding pattern I've
> found useful in the past.)

Yes, I understand your point. But the switch doesn't cover all cases so between
extending it to cover all cases with nothing and a default I prefer having a
default (in general), even if it "breaks" if the enum changes. On the other
hand, if changing one line makes you happy I'm happy.

rvargas (doing something else)

New version up, ptal. https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index_table.cc File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index_table.cc#newcode922 net/disk_cache/v3/index_table.cc:922: header()->max_bucket = mask_; On 2013/12/07 ...

7 years ago (2013-12-19 23:51:24 UTC) #44

New version up, ptal.

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:922: header()->max_bucket = mask_;
On 2013/12/07 02:13:45, rvargas wrote:
> On 2013/12/05 19:17:48, rdsmith wrote:
> > On 2013/12/04 01:04:17, rvargas wrote:
> > > On 2013/12/02 21:48:06, rdsmith wrote:
> > > > nit: This seems like a strange place to do this assignment.  My
> assumption,
> > > when
> > > > I saw the call to MoveCells() in Init(), was that all it did was move
the
> > > cells,
> > > > but this is a global state change.  To me, it would more naturally be
> paired
> > > > with the zeroing of extra_table; that's the point at which the
extra_table
> > > > entries become unavailable to the normal lookup (to be made available
> again
> > in
> > > > this routine as the cells are put in one at a time).
> > > 
> > > Sort of. The state of this object is in flux while Init is running, and
this
> > > method only runs inside Init, so is not like the internal state should be
> > > consistent when this method starts. 
> > 
> > Fair, though also tricky to realize for a code reader, since MoveCells() is
> > neither next to Init() nor commented to be only a subroutine of Init(). 
> > Explanatory comment?
> > 
> 
> done
> 
> > > This method receives a pointer to the extra
> > > table because the core of Init is switching tables and bitmaps and
clearing
> > > buffers for this method. But it seems a little silly to clear a member in
> Init
> > > only to pass its value as an explicit argument somewhere else that would
be
> in
> > > charge of setting the final value for the member. 
> > 
> > So to be clear, the image in my mind was of setting header()->max_bucket =
> > mask_; in Init() after the call to MoveCells().  I don't see any
dependencies
> on
> 
> If I understand correctly, that would fail. The act of writing new cells into
> the tables (and that happens when old cells are moved), changes max_bucket, as
> that is just the last used cell from the extra table (or more exactly, the
first
> unused bucket). A write to the main table is fine, but any write to the extra
> table may move max_bucket forward.
> 
> > header()->max_bucket in this routine, so I don't think any explicit
arguments
> > would be required.  It might be worth a comment in Init() that "MoveCells()
> > depends on the pre-doubling values of header()->max_bucket", but what's
really
> > true is that MoveCells() needs to be understood in the context of Init();
> there
> > isn't a good interface contract between the two routines.
> 
> yeah, I added a comment on MoveCells; let me know if that's not enough.
> 
> > 
> > 
> > > I mean, the final value of
> > > max_bucket is only stable when MoveCells completes so it seemed better to
> > > delegate all actions related to that field to this method. (as always, let
> me
> > > know if you feel strongly about doing that on Init)
> > 
> > I don't care strongly; I think at base what I'm reacting to is that
> MoveCells()
> > doesn't make sense without knowing the particular point in Init() when it's
> > called (i.e. it's not particularly easy to define an interface contract).  
> But
> > I don't see a way to fix that; I do think pulling it out as a subroutine
makes
> > sense just to keep Init() from being a very long function, but it only sorta
> > helps on the abstraction level.
> 
> Exactly. It is more a "now we need to sit down and move stuff, but we are not
> really done with init"
> 
> > 
> > > > Also, it occurs to me reading this code: Where do we fix up the bucket
> > chains
> > > > that are left dangling by folding the extra table into the main table on
a
> > > > doubling?
> > > 
> > > I don't follow the question. We take a chain one at a time and follow it
to
> > the
> > > end, moving some entries from the main table and all entries from the
extra
> > > table (to the new extra table)
> > 
> > I was wondering when the value of bucket->next was updated for the buckets
in
> > the main table.  It looks to me as if the only two places where the next
field
> > is set are in CreateEntryCell() and in GetNextBucket().  Naively, that would
> > seem to imply to me that after moving all the cells from the extra table to
> the
> > main table in MoveCells(), we've got pointers into invalid parts of the
extra
> > table still dangling from the main table buckets.  If GetNextBucket() is
> called
> > on those buckets before any buckets are added to the extra table, that
> function
> > will fix/zero that pointer.  But if we add an extra bucket to the extra
table
> > before calling GetNextBucket() on the main table bucket that used to point
to
> > that extra table, we'll have an incorrect pointer into that bucket.  Am I
> > confused?
> 
> That sounds correct (seems like there's a bug there).

There was actually no bug because the old value will point to the main table
(and not the extra table) so it will be ignored by GetNextBucket. I added a
comment and another unit test.

Randy Smith (Not in Mondays)

On 2013/12/03 00:41:32, rvargas wrote: > On 2013/12/02 19:28:01, rdsmith wrote: > > On 2013/11/26 ...

6 years, 12 months ago (2013-12-26 19:37:24 UTC) #45

On 2013/12/03 00:41:32, rvargas wrote:
> On 2013/12/02 19:28:01, rdsmith wrote:
> > On 2013/11/26 00:48:56, rvargas wrote:
> > > On 2013/11/25 19:48:07, rdsmith wrote:
> > > > I feel as if both address and hash could benefit from a comment
somewhere
> > > > indicating that they mean different things in different parts of the
code,
> > > with
> > > > an explanation as to what those differences are (if this is in the
design
> > doc,
> > > > that's fine, and my apologies for forgetting it).
> > > 
> > > How the hash is used to locate entries on the table (and doubling of the
> > table)
> > > is covered on the design doc. But I don't fully follow what you're
thinking.
> > > 
> > > thanks.
> > 
> > So the design document does cover what the hash is and how it's used to
lookup
> > entries (and thank you--it was useful for me to go back to the design doc. 
I
> > think I get more from it each time I go over it after reading the code, not
> > surprisingly).  But my concern is more that the concepts "hash" and
"address"
> > mean different things at different places in the code, and I think it's easy
> for
> > the reader to assume that they mean the same thing, and be confused that
way. 
> > If they were consistently referred to by different names in the code, that
> would
> > address my concern.  I'm not sure if this should be a high priority--if the
> > concepts are clear, it may be that a code reader will be able to track by
> > context which meaning of hash/address is currently relevant.  But that's the
> > concern I was raising. 
> > 
> > With regard to addresses, I think my confusion is between what I'll call
"full
> > addresses" (i.e. what Addr() passes around), and "address values" (what you
> get
> > from IndexCells and EntryCell::GetAddressValue()).  I think it would be
> clearer
> > if those things used different names consistently.  Does this make sense to
> you?
> >  I think that would mean:
> >   * Changing the names of {Get,Set}Cell{SmallTable,}Address &
> >     {FileNumber,StartBlock}FromAddress, along with the variables used
> >     inside those functions?  
> >   * I think based on its usage in IsValidAddress, kMaxAddress would also
> >     become kMaxAddressValue.
> 
> Sure, as I said before, this can be improved by a different name but I have no
> good ideas about what to call it.
> 
> So here's the whole picture:
> disk_cache::Addr -> this is the regular address passed around most of the code
> (I mean, this is visible to most files under disk_cache), and it is the
general
> address of anything (note that the type is not Address)
> 
> disk_cache::Addr::value == disk_cache::CacheAddr == uint32. This is what is
> stored inside an Addr, and it is just a number of a fixed length. I mentioned
> because it has a dedicated type name (and some code uses it, of course), and
> because it "has" the nickname of "address value" as it is accessible as that
> from any Addr.
> 
> And now the new neighbors:
> The address of a Cell -> currently also known as "address", "defined" on the
> file format as that, and with a variable bit length. Originally meant to be
> basically the cell number, or cell index (but index is a very bad name).
Should
> be fairly private to this file (index_table.cc)
> 
> With the added twist that once you have an EntyCell, given that it is meant to
> be used from the outside, it must provide a for of address, and so it
translates
> from cell number to Addr (so Address() returns Addr). But on the other hand,
it
> cannot hide the short version from other code _from this file_, as this file
is
> the one who is aware of the details of how many bits go where, so it provides
> private methods to get to the "real" cell number.
> 
> As such, I would not want to go to AddressValue everywhere (even though today
> that's the closest I have to be able to tell that it means a cell number),
> because it is too close to Addr::value.
> 
> I would be happier if what is stored inside a Cell is a Fraby and so Address()
> returns Addr and GetFraby returns Fraby. The caveat of course is that there is
a
> very simple translation from Fraby to CacheAddr.
> 
> How about Location for anything that relates to the cell number?

That would be fine, but any reason not to use CellNumber?  "Location" sounds
more general, and "CellNumber" sounds a bit more like an implementation detail,
which I think is the more correct impression.  (But "Location" is fine if you
prefer it.)

> 
> > 
> > With regard to hash, similarly there's the portion of the hash that's stored
> in
> > IndexCell (call it "high hash"), the portion that's implied by the position
of
> > the IndexCell within the IndexTable ("low hash"), and the whole thing
(gotten
> > from base::Hash() and split up as needed, "full hash").  It looks to me as
if
> > what I'm calling the low hash is generally referred to as a bucket hash, so
it
> > may be the only real confusion is between high hash and full hash (for
> example,
> > the EntryCell constructor takes a full hash and passes a high hash to
> > SetCellHash, storing the full hash in hash_).  I'm not sure what would be
> > involved in clearing this up or if it's worth it, but I suspect that just
> using
> > different words for the high hash and the full hash (whatever they are)
would
> > reduce confusion in code readers.
> 
> Ah, now I see what you meant.
> 
> On the same lines, I could rename the thing that lives inside a cell as "id"
as
> in it is what identifies whatever is stored in that cell. It may be a little
> weird how the id is derived from part of the hash (versus it _is_ part of the
> hash), but the cell doesn't really care what is there (not that it cares about
> any other field, but I digress)
> 
> uint32 IndexTable::GetFullHash(const IndexCell& cell, uint32 lower_part) {
>   // It is OK for the high order bits of lower_part to overlap with the stored
>   // part of the hash.
>   if (small_table_)
>     return (GetCellSmallTableId(cell) << kHashSmallTableShift) | lower_part;
> 
>   return (GetCellId(cell) << kHashShift) | lower_part;
> }
> 
> That would change bucket_hash to something else (and bucket_id, cell_id)...
> maybe bucket_hash becomes bucket_id.

Sure, that sounds good.  I'd like a comment (or a design doc section, which may
already exist :-}) that describes the relationship between these concepts, but I
do like separating them out.  And I like not naming them anything with "hash" in
it, as that implies a reasonably close, non-complicated relationship to the
actually hash, and the relationship there *is* a bit complicated.

Randy Smith (Not in Mondays)

On 2013/12/04 20:59:34, rdsmith wrote: > On 2013/12/02 21:01:49, rvargas wrote: > > On 2013/12/02 ...

6 years, 12 months ago (2013-12-26 19:39:30 UTC) #46

On 2013/12/04 20:59:34, rdsmith wrote:
> On 2013/12/02 21:01:49, rvargas wrote:
> > On 2013/12/02 18:53:42, rdsmith wrote:
> > > On 2013/11/26 00:48:56, rvargas wrote:
> > > > On 2013/11/25 19:48:07, rdsmith wrote:
> > > > > Part-way through reviewing Init(), and getting somewhat bogged down in
> > > > > understanding the use of hash values, so pausing to provide these
> comments
> > > and
> > > > > ask for clarifications on hash.  I believe that:
> > > > > * The hash for a particular key is 32 bits, generated by
base::Hash(key)
> > > > > * The bottom bits (a variable number of bits, depending on
> > table_len/mask_)
> > > > are
> > > > > used as the address of the relevant bucket for that entry.
> > > > > * The top bits (32 - kHash{SmallTable,}Shift) are stored in the
> IndexCell
> > > and
> > > > > compared on cell lookup.  That shift means that we're comparing just
the
> > > > number
> > > > > of bits stored in the table for each size; i.e.
kHash{SmallTable,}Shift
> is
> > > > > dependent on the number of bits stored for the hash in the different
> types
> > > of
> > > > > table/cells.
> > > > > * The top and bottom bits in this scheme may overlap.
> > > > > * When we increase the table size via doubling (i.e. increase the
number
> > of
> > > > > buckets), that may mean that the extra bit (or bits, if we haven't
> touched
> > a
> > > > > cell from previous doublings) used in indexing into the table may be
> > wrong,
> > > > and
> > > > > the cell may need to be moved.  This is what the MisplacedCell() logic
> is
> > > > about.
> > > > 
> > > > Roughly yes. As the table grows, bits move from the "we need to store
this
> > > part"
> > > > to the "this part is implied given the position in the table". There is
no
> > > need
> > > > to store store less and less bits each time (and complicate all the
logic)
> > so
> > > > that means that each time we double the size, there's another bit that
> > becomes
> > > > redundant.
> > > > 
> > > > When we double the table, we move all cells that have to moved to the
> upper
> > > half
> > > > of the table (about half the cells should move). If that goes well,
there
> > will
> > > > never be a misplaced cell. However, if there is a crash while we are
> moving
> > > > cells (even though this is a memory-only operation), we could end up
with
> > some
> > > > cells on the right place and some cells in the wrong place... those
> > misplaced
> > > > cells are detected by a mismatch on the redundant bits, and can be moved
> as
> > we
> > > > see them.
> > > 
> > > That makes sense.  Where is the logic that moves the cells on table
> doubling? 
> > > Scanning Init(), I only see a call to MoveCells for the old_extra_table
> cells.
> >  
> > 
> > At the end of Init(), there is a call to MoveCells(). MoveSingleCell() goes
> > through a loop and calls MoveSingleCell() when needed... note that the
> function
> > goes through main_table_, even though it needs the caller to provide a
pointer
> > to the old_extra_table.
> 
> Right--I had incorrectly thought that MoveCells() was only about the extra
> table.  Thanks for the pointer.
> 
> > 
> > 
> > > 
> > > Two flags get raised in my brain around crash handling (== unlikely but
not
> > > impossible corner case) code.  The first is that it needs to be tested
> > > thoroughly, because our chances of noticing problems in it from more
general
> > > testing and the field is lower.  Quickly skimming the unit test, I'm not
> sure
> > I
> > > see anything that will result in exercising the
> > > MisplacedHash/HandleMisplacedHash logic; am I missing it?   
> > > 
> > > The second is that code to handle rare corner cases can sometimes result
in
> > > masking other bugs.  It'd be nice if there was some way to disable the
> > misplaced
> > > hash logic for the tests that aren't trying to emulate the crashing code. 
 
> > > 
> > 
> > The tests are not going through inconsistent states or checks at this
point...
> > some of the checks performed will crash the test if they fail (because most
> > tests run without a backend)... and the tests themselves are not generating
> bad
> > data (but I must admit that was a manual process... as in they did for a
> while,
> > and then some step of the test will fail, as not being able to locate
> > something).
> > 
> > I'm not too worried at this point with error recovery paths not covered by
> > tests; I have not even written all those paths to start with, and it is more
> > important to bring the whole system up and start running unit tests of the
> whole
> > thing.
> 
> I'm happy to go with postponing corner case testing until later in the
process,
> but I would sorta like to keep to the invariant that code has tests associated
> with it when it's committed.  Does that make sense to you?  Any easy way to
add
> tests for the misplaced cell logic?

I didn't see a response to this comment in the comment thread or code; did you
have any reaction to it?

> 
> > 
> > 
> > > WDYT?
> > > 
> > > (Still working on review; responding to this now just because it's a
> top-level
> > > comment and thus easiest to respond to directly.)
> > > 
> > > > 
> > > > >  
> > > > > 
> > > > > That all correct?  
> > > > > 
> > > > > I feel as if both address and hash could benefit from a comment
> somewhere
> > > > > indicating that they mean different things in different parts of the
> code,
> > > > with
> > > > > an explanation as to what those differences are (if this is in the
> design
> > > doc,
> > > > > that's fine, and my apologies for forgetting it).
> > > > 
> > > > How the hash is used to locate entries on the table (and doubling of the
> > > table)
> > > > is covered on the design doc. But I don't fully follow what you're
> thinking.
> > > > 
> > > > thanks.

Randy Smith (Not in Mondays)

On 2013/12/26 19:39:30, rdsmith wrote: > On 2013/12/04 20:59:34, rdsmith wrote: > > On 2013/12/02 ...

6 years, 12 months ago (2013-12-26 21:39:41 UTC) #47

On 2013/12/26 19:39:30, rdsmith wrote:
> On 2013/12/04 20:59:34, rdsmith wrote:
> > On 2013/12/02 21:01:49, rvargas wrote:
> > > On 2013/12/02 18:53:42, rdsmith wrote:
> > > > On 2013/11/26 00:48:56, rvargas wrote:
> > > > > On 2013/11/25 19:48:07, rdsmith wrote:
> > > > > > Part-way through reviewing Init(), and getting somewhat bogged down
in
> > > > > > understanding the use of hash values, so pausing to provide these
> > comments
> > > > and
> > > > > > ask for clarifications on hash.  I believe that:
> > > > > > * The hash for a particular key is 32 bits, generated by
> base::Hash(key)
> > > > > > * The bottom bits (a variable number of bits, depending on
> > > table_len/mask_)
> > > > > are
> > > > > > used as the address of the relevant bucket for that entry.
> > > > > > * The top bits (32 - kHash{SmallTable,}Shift) are stored in the
> > IndexCell
> > > > and
> > > > > > compared on cell lookup.  That shift means that we're comparing just
> the
> > > > > number
> > > > > > of bits stored in the table for each size; i.e.
> kHash{SmallTable,}Shift
> > is
> > > > > > dependent on the number of bits stored for the hash in the different
> > types
> > > > of
> > > > > > table/cells.
> > > > > > * The top and bottom bits in this scheme may overlap.
> > > > > > * When we increase the table size via doubling (i.e. increase the
> number
> > > of
> > > > > > buckets), that may mean that the extra bit (or bits, if we haven't
> > touched
> > > a
> > > > > > cell from previous doublings) used in indexing into the table may be
> > > wrong,
> > > > > and
> > > > > > the cell may need to be moved.  This is what the MisplacedCell()
logic
> > is
> > > > > about.
> > > > > 
> > > > > Roughly yes. As the table grows, bits move from the "we need to store
> this
> > > > part"
> > > > > to the "this part is implied given the position in the table". There
is
> no
> > > > need
> > > > > to store store less and less bits each time (and complicate all the
> logic)
> > > so
> > > > > that means that each time we double the size, there's another bit that
> > > becomes
> > > > > redundant.
> > > > > 
> > > > > When we double the table, we move all cells that have to moved to the
> > upper
> > > > half
> > > > > of the table (about half the cells should move). If that goes well,
> there
> > > will
> > > > > never be a misplaced cell. However, if there is a crash while we are
> > moving
> > > > > cells (even though this is a memory-only operation), we could end up
> with
> > > some
> > > > > cells on the right place and some cells in the wrong place... those
> > > misplaced
> > > > > cells are detected by a mismatch on the redundant bits, and can be
moved
> > as
> > > we
> > > > > see them.
> > > > 
> > > > That makes sense.  Where is the logic that moves the cells on table
> > doubling? 
> > > > Scanning Init(), I only see a call to MoveCells for the old_extra_table
> > cells.
> > >  
> > > 
> > > At the end of Init(), there is a call to MoveCells(). MoveSingleCell()
goes
> > > through a loop and calls MoveSingleCell() when needed... note that the
> > function
> > > goes through main_table_, even though it needs the caller to provide a
> pointer
> > > to the old_extra_table.
> > 
> > Right--I had incorrectly thought that MoveCells() was only about the extra
> > table.  Thanks for the pointer.
> > 
> > > 
> > > 
> > > > 
> > > > Two flags get raised in my brain around crash handling (== unlikely but
> not
> > > > impossible corner case) code.  The first is that it needs to be tested
> > > > thoroughly, because our chances of noticing problems in it from more
> general
> > > > testing and the field is lower.  Quickly skimming the unit test, I'm not
> > sure
> > > I
> > > > see anything that will result in exercising the
> > > > MisplacedHash/HandleMisplacedHash logic; am I missing it?   
> > > > 
> > > > The second is that code to handle rare corner cases can sometimes result
> in
> > > > masking other bugs.  It'd be nice if there was some way to disable the
> > > misplaced
> > > > hash logic for the tests that aren't trying to emulate the crashing
code. 
>  
> > > > 
> > > 
> > > The tests are not going through inconsistent states or checks at this
> point...
> > > some of the checks performed will crash the test if they fail (because
most
> > > tests run without a backend)... and the tests themselves are not
generating
> > bad
> > > data (but I must admit that was a manual process... as in they did for a
> > while,
> > > and then some step of the test will fail, as not being able to locate
> > > something).
> > > 
> > > I'm not too worried at this point with error recovery paths not covered by
> > > tests; I have not even written all those paths to start with, and it is
more
> > > important to bring the whole system up and start running unit tests of the
> > whole
> > > thing.
> > 
> > I'm happy to go with postponing corner case testing until later in the
> process,
> > but I would sorta like to keep to the invariant that code has tests
associated
> > with it when it's committed.  Does that make sense to you?  Any easy way to
> add
> > tests for the misplaced cell logic?
> 
> I didn't see a response to this comment in the comment thread or code; did you
> have any reaction to it?

Whoops, now I see the NOTREACHED(); in HandleMisplacedCells().  I'm good with
that.

> 
> > 
> > > 
> > > 
> > > > WDYT?
> > > > 
> > > > (Still working on review; responding to this now just because it's a
> > top-level
> > > > comment and thus easiest to respond to directly.)
> > > > 
> > > > > 
> > > > > >  
> > > > > > 
> > > > > > That all correct?  
> > > > > > 
> > > > > > I feel as if both address and hash could benefit from a comment
> > somewhere
> > > > > > indicating that they mean different things in different parts of the
> > code,
> > > > > with
> > > > > > an explanation as to what those differences are (if this is in the
> > design
> > > > doc,
> > > > > > that's fine, and my apologies for forgetting it).
> > > > > 
> > > > > How the hash is used to locate entries on the table (and doubling of
the
> > > > table)
> > > > > is covered on the design doc. But I don't fully follow what you're
> > thinking.
> > > > > 
> > > > > thanks.

Randy Smith (Not in Mondays)

Next round of comments. One high level question, that doesn't have much if any relationship ...

6 years, 12 months ago (2013-12-26 21:45:48 UTC) #48

Next round of comments.  

One high level question, that doesn't have much if any relationship to this CL:
In the design doc, you mention that at some point when you're getting close to
overflow on the timestamp you'll need to rebase all timestamps on the table.  I
don't seen any logic for this operation in your CL 17507006 (the whole v3 disk
cache change); is this TBD?  It'll obviously (?) need to be done before this
code is released into the wild.

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:473: DCHECK_LE(extra_bits_, 11);
On 2013/12/05 19:17:48, rdsmith wrote:
> On 2013/12/04 01:04:17, rvargas wrote:
> > On 2013/12/02 21:48:06, rdsmith wrote:
> > > On 2013/11/26 00:32:41, rvargas wrote:
> > > > On 2013/11/25 19:48:07, rdsmith wrote:
> > > > > What results in this 11?  This seems like you're simply saying that
you
> > > can't
> > > > > have a table_len greater than 32 bits, in which case don't you want to
> > just
> > > > say
> > > > > that?  (I.e. IIRTCC the 11 is dependent on a specific value of
> > > kBaseTableLen,
> > > > > and that dependency should be explicit in the DCHECK).
> > > > 
> > > > This is just saying that the table can be doubled 11 times from the
> smaller
> > > > table  to the biggest table. Yes, the value (11) is derived from
> > kBaseTableLen
> > > > and maybe kMaxAddress or kCellAddressMask, but I thought it was clearer
to
> > > have
> > > > the number of bits here (right when we are saying "how many extra bits
can
> > we
> > > > have") rather to come out with a probably cryptic formula that derives
the
> > > > proper value from other numbers. The cost is of course less flexibility
if
> > the
> > > > format change, but at least it will fail right when it is needed. When I
> was
> > > > writing the code this value changed right here multiple times depending
on
> > > what
> > > > I had implemented (and was ready to test) so it seemed better to keep it
> > that
> > > > way instead of tying it to other constants.
> > > 
> > > Huh.  I don't particularly care about future flexibility (I mean, future
> > > flexibility is good, just not my highest current priority :-}).  But from
my
> > > perspective in reading the code, having the cryptic formulas calls out to
> the
> > > reader what the dependencies between the different constants are.  I had
> been
> > > thinking of asking if we could have constants defined in disk_format_v3.h
> for
> > > the different pieces of the IndexCell, and then define the other constants
> > that
> > > need to match those based on those constants.  It means if you're tracing
> the
> > > "Why is this number this value?" question, you get roadsigns.  
> > > 
> > > Can you say more about why having an 11 is clearer?  I'm not getting it. 
It
> > > doesn't feel like it tells the code reader/maintainer/debugger anything.
> > >  
> > > > I can define a local constant here if you don't like it.
> > > 
> > > I'm not interested in "const uint32 kExtraBitsLimit = 11;", if that's what
> > > you're offering; I don't find that any more useful than just using 11. 
But
> > I'd
> > > be interested in your thoughts about my general argument above.
> > > 
> > > 
> > 
> > My best example would be the failure of the formula two lines below this
one.
> > 
> > I can come up with something like the log of the largest table divided by
the
> > smallest table, but then that rises the issue of deriving the largest table.
I
> > hear what you are saying and I generally agree, to the point of having a
bunch
> > of constants with meaningful names whose values can be derived easily by
just
> > inspection (the reason for the constants here).
> > 
> > In this particular case, we have a line that calculates a number from the
> > comparison of the current table versus the smallest table. The first check
> says
> > we cannot have a negative number (a table smaller than the smallest), the
> second
> > line checks the value against an upper bound, and that should tell right
away
> > that we are checking for the largest possible table (right?)
> > 
> > The only question is then why the largest table is 11 times larger than the
> > smallest table, or in other words, what is the size of the largest table.
> > Honestly, if I'm reading some code that I didn't write, I would not care
much
> > about the value of that number (the actual limit for the test): I would
mostly
> > care for some limit being there, either on a check, or in the code that uses
> > that number. In this case, it would mean to me the same if that value is 8
or
> > 15, or derived from a formula. 8 could be wrong, the formula could be off by
> one
> > and in any case I would not spend time looking at the value unless I'm
> actually
> > trying to debug a problem and I have reason to suspect that the max value
> could
> > be at fault. And if I'm doing that, I would probably ignore the formula,
> figure
> > out what the value should be and compare that against what the code says
> (number
> > of bits on an address = 22, so 21 of them go to the main table; smaller
table
> =
> > 1024 (10 bits), so delta = 21 - 10)
> > 
> > And that's why I prefer a plain number here rather than a formula... given
> that
> > too much may be read from the formula (as in the check against 6). What I
> think
> > it matters is not the value, but the check itself (double check an upper
> limit,
> > whatever that is)
> > 
> > I hope I don't sound combative, that's not my intention. I just want to
answer
> > the question before we setup in what to do.
> 
> Nope, you don't sound combative, and I appreciate the detailed explanation.
> 
> > 
> > I'm up for moving relevant constants to the file format header. That's why
> > kBaseTableLen is there. Shifts and masks I'm not so sure as they make more
> sense
> > with the code that uses them, and I don't want to make them visible to any
> code
> > outside this file.
> 
> I think most of my feeling on this topic would be addressed by making explicit
> the relationships between the different constants--I don't have strong general
> feelings about either where the constants are declared or this particular
> location being a number rather than a constant 11.  But I would like constants
> that are used in multiple places in the code and do have a relationship to
each
> other to be expressed in terms of each other.  So for instance,
disk_format_v3.h
> would have k{SmallTable,}HashOffset and k{SmallTable,}HashLength, and
> kHash{SmallTable,}Shift and kCell{SmallTable,}HashMask would be defined from
> those values (probably in index_table.cc).  Would you be good with that?

Any thoughts on this request?

An extreme example that I don't think I've previously mentioned: kBaseTableLen
is defined in three different places (backend_impl.cc, backend_impl_v3.cc, and
backend_worker.cc, disk_format_v3.h), and the definition in the .h file looks to
contradict the common definition in the .cc files.  

> 
> Having said all that, the place I diverge from you in modeling this code being
> read by someone who doesn't know it is, if they suspected the max value would
be
> at fault, the formula would give them a hint as to how that max was derived. 
> That hint might be wrong, but it's structure that would lead them further into
> the code, and point at the logic that they would need to vet.  But as I say, I
> more care about the relationship between the defined constants being clear
than
> about this particular number being a formula.

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:519:
backup_header_.reset(params->backup_header.release());
On 2013/12/07 02:13:45, rvargas wrote:
> On 2013/12/05 19:17:48, rdsmith wrote:
> > On 2013/12/04 01:04:17, rvargas wrote:
> > > On 2013/12/02 21:48:06, rdsmith wrote:
> > > > What's the context for these parameters?  They're only used on the first
> > call
> > > to
> > > > Init(); are they defined on other calls?  And where do they come from on
> the
> > > > first call to Init()?  What do they contain?  It feels a bit weird to
have
> > > this
> > > > function responsible for allocating storage on growing, but not on first
> > > > initialization.
> > > 
> > > There are dedicated files to store the backups. When the cache starts,
those
> > > files are read and the contents are provided to this method, so that we
can
> > > check the backup against the mmaped bitmap. When growing, there is no need
> to
> > > read any file because the up to date version is already in memory, owned
by
> > this
> > > object.
> > 
> > So that means at some point in the future there will be code checking the
> backup
> > bitmaps against the current bitmaps on first load?
> 
> Checking against the backup is already in place. It is performed on demand by
> CheckState()
> 
> 
> > Also, what's the allocation behavior for params->backup_header and
> backup_bitmap
> > on calls to Init() after the first?  There's not a code correctness issue,
but
> > it's sorta a shame if they keep getting allocated and then deleted without
> being
> > examined.
> 
> The caller doesn't pass them. That would require reading the files, and that
> doesn't happen for reinit.

Clarifying comment, either on IndexTableInitData or Init()?  

(Side note: I've been on-and-off tempted reading this code to suggest that it be
pulled out into three separate functions for the three separate things that
could be happening: initialization, re-init without doubling (increase in the
extra table), and re-init with doubling, because as I've read the function I've
had to spend some mental energy tracking those three pathways and make sure each
is clean.  But I haven't been able to convince myself that the code would
actually be cleaner and more maintainable that way; it would involve a lot more
functions that each branch would call, and the obvious way to implement it would
be for Init() to branch when first entered, so I haven't brought it up.  I'm
just mentioning it now in case the various requests I'm making for comments to
help keep those three things straight are onerous enough that the alternative
has appeal :-}.)

https://codereview.chromium.org/53313004/diff/1930001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1930001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:687: break;
On 2013/12/07 02:13:45, rvargas wrote:
> On 2013/12/05 19:17:48, rdsmith wrote:
> > This pattern doesn't have much virtue over if conditionals if it has a
default
> > branch; the advantage (at least to me) of using a switch statement is that
if
> > you don't have a default and the enum changes, the compiler warns you that
you
> > have to update the switch statement.  
> > 
> > (Still completely a suggestion; just sharing with you a coding pattern I've
> > found useful in the past.)
> 
> Yes, I understand your point. But the switch doesn't cover all cases so
between
> extending it to cover all cases with nothing and a default I prefer having a
> default (in general), even if it "breaks" if the enum changes. On the other
> hand, if changing one line makes you happy I'm happy.

No worries; just a suggestion.

https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/disk_...
File net/disk_cache/v3/disk_format_v3.h (right):

https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/disk_...
net/disk_cache/v3/disk_format_v3.h:61: CACHE_EVICTED = 1 << 2      // Already
evicted at least one entry.
nit, suggestion: It's a bit worrisome to see a non-backward compatible change in
a file describing disk format.  I poked around in the code, and as I'm sure you
know, this is fine because CACHE_EVICT* aren't used anywhere.  But is there a
reason not to have the discipline of adding new flags at the end of the enum so
that that type of thing doesn't need to be verified?

(I suppose that strictly speaking you should always bump the version when you
modify anything in the format, and I can completely understand not wanting to do
that during development, and as long as you're in development you're the only
person who'll be inconvenienced by mistakes in this space.  Thus "nit,
suggestion" :-}.  But it does look a bit weird from the outside.)

https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:990: // the link.
This comment is useful for someone reading through to try and understand
initialization, but the place where I think the comment would be most useful is
in GetNextBucket().  If this dangling pointer ever does cause problems, it'll be
there.

https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.h (right):

https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:146: virtual void SaveIndex(net::IOBuffer*
buffer, int buffer_len) = 0;
nit, suggestion: My scan of the code makes me think that we don't actually need
the complexity of the IOBuffer here; it looks like OnBackupTimer() just passes
ownership to the backend (the reference is dropped as soon as SaveIndex() is
called) and that nothing in the backend uses the refcounting/cancellation
semantics of IOBuffer.  If that's the case, it would seem simpler to pass a
scoped_array<char>?

https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:180: // called just to reinitialize the object.
Willing to say a bit about what it should be used for?  I wasn't completely
clear looking at the usage in the total V3 CL what the purpose of this is; is it
just a shutdown method?  In which case calling it "Shutdown()" might make more
sense.

https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table_unittest.cc (right):

https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table_unittest.cc:574: // Tests doubling of the table.
A couple of cases I'm not certain this test covers that I think would be good to
make sure are covered:
* Doublings that will require moving cells in the table because of the boundary
change as to what portion of the hash is used for bucket_id.
* Cells in the extra table on doubling, which will need to be successfully
copied into the main table.

https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table_unittest.cc:615: // Tests bucket chaining when
growing the index.
Why does using the upper half of the table mean we're exercising bucket
chaining?  Don't we need more than four entries with the same bucket id to test
chaining?

https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table_unittest.cc:706: }
Possibly also a test for the logic in CheckState()?

rvargas (doing something else)

On 2013/12/26 19:37:24, rdsmith wrote: > On 2013/12/03 00:41:32, rvargas wrote: > > On 2013/12/02 ...

6 years, 12 months ago (2013-12-26 23:44:15 UTC) #49

On 2013/12/26 19:37:24, rdsmith wrote:
> On 2013/12/03 00:41:32, rvargas wrote:
> > On 2013/12/02 19:28:01, rdsmith wrote:
> > > On 2013/11/26 00:48:56, rvargas wrote:
> > > > On 2013/11/25 19:48:07, rdsmith wrote:
> > > > > I feel as if both address and hash could benefit from a comment
> somewhere
> > > > > indicating that they mean different things in different parts of the
> code,
> > > > with
> > > > > an explanation as to what those differences are (if this is in the
> design
> > > doc,
> > > > > that's fine, and my apologies for forgetting it).
> > > > 
> > > > How the hash is used to locate entries on the table (and doubling of the
> > > table)
> > > > is covered on the design doc. But I don't fully follow what you're
> thinking.
> > > > 
> > > > thanks.
> > > 
> > > So the design document does cover what the hash is and how it's used to
> lookup
> > > entries (and thank you--it was useful for me to go back to the design doc.

> I
> > > think I get more from it each time I go over it after reading the code,
not
> > > surprisingly).  But my concern is more that the concepts "hash" and
> "address"
> > > mean different things at different places in the code, and I think it's
easy
> > for
> > > the reader to assume that they mean the same thing, and be confused that
> way. 
> > > If they were consistently referred to by different names in the code, that
> > would
> > > address my concern.  I'm not sure if this should be a high priority--if
the
> > > concepts are clear, it may be that a code reader will be able to track by
> > > context which meaning of hash/address is currently relevant.  But that's
the
> > > concern I was raising. 
> > > 
> > > With regard to addresses, I think my confusion is between what I'll call
> "full
> > > addresses" (i.e. what Addr() passes around), and "address values" (what
you
> > get
> > > from IndexCells and EntryCell::GetAddressValue()).  I think it would be
> > clearer
> > > if those things used different names consistently.  Does this make sense
to
> > you?
> > >  I think that would mean:
> > >   * Changing the names of {Get,Set}Cell{SmallTable,}Address &
> > >     {FileNumber,StartBlock}FromAddress, along with the variables used
> > >     inside those functions?  
> > >   * I think based on its usage in IsValidAddress, kMaxAddress would also
> > >     become kMaxAddressValue.
> > 
> > Sure, as I said before, this can be improved by a different name but I have
no
> > good ideas about what to call it.
> > 
> > So here's the whole picture:
> > disk_cache::Addr -> this is the regular address passed around most of the
code
> > (I mean, this is visible to most files under disk_cache), and it is the
> general
> > address of anything (note that the type is not Address)
> > 
> > disk_cache::Addr::value == disk_cache::CacheAddr == uint32. This is what is
> > stored inside an Addr, and it is just a number of a fixed length. I
mentioned
> > because it has a dedicated type name (and some code uses it, of course), and
> > because it "has" the nickname of "address value" as it is accessible as that
> > from any Addr.
> > 
> > And now the new neighbors:
> > The address of a Cell -> currently also known as "address", "defined" on the
> > file format as that, and with a variable bit length. Originally meant to be
> > basically the cell number, or cell index (but index is a very bad name).
> Should
> > be fairly private to this file (index_table.cc)
> > 
> > With the added twist that once you have an EntyCell, given that it is meant
to
> > be used from the outside, it must provide a for of address, and so it
> translates
> > from cell number to Addr (so Address() returns Addr). But on the other hand,
> it
> > cannot hide the short version from other code _from this file_, as this file
> is
> > the one who is aware of the details of how many bits go where, so it
provides
> > private methods to get to the "real" cell number.
> > 
> > As such, I would not want to go to AddressValue everywhere (even though
today
> > that's the closest I have to be able to tell that it means a cell number),
> > because it is too close to Addr::value.
> > 
> > I would be happier if what is stored inside a Cell is a Fraby and so
Address()
> > returns Addr and GetFraby returns Fraby. The caveat of course is that there
is
> a
> > very simple translation from Fraby to CacheAddr.
> > 
> > How about Location for anything that relates to the cell number?
> 
> That would be fine, but any reason not to use CellNumber?  "Location" sounds
> more general, and "CellNumber" sounds a bit more like an implementation
detail,
> which I think is the more correct impression.  (But "Location" is fine if you
> prefer it.)


I meant "Fraby" -> "Location". In other words, the number that is stored inside
a cell would become "Location" instead of "Address".

CellNumber implies the number of a given cell inside the table (10th cell, part
of the 3rd bucket etc).



> 
> > 
> > > 
> > > With regard to hash, similarly there's the portion of the hash that's
stored
> > in
> > > IndexCell (call it "high hash"), the portion that's implied by the
position
> of
> > > the IndexCell within the IndexTable ("low hash"), and the whole thing
> (gotten
> > > from base::Hash() and split up as needed, "full hash").  It looks to me as
> if
> > > what I'm calling the low hash is generally referred to as a bucket hash,
so
> it
> > > may be the only real confusion is between high hash and full hash (for
> > example,
> > > the EntryCell constructor takes a full hash and passes a high hash to
> > > SetCellHash, storing the full hash in hash_).  I'm not sure what would be
> > > involved in clearing this up or if it's worth it, but I suspect that just
> > using
> > > different words for the high hash and the full hash (whatever they are)
> would
> > > reduce confusion in code readers.
> > 
> > Ah, now I see what you meant.
> > 
> > On the same lines, I could rename the thing that lives inside a cell as "id"
> as
> > in it is what identifies whatever is stored in that cell. It may be a little
> > weird how the id is derived from part of the hash (versus it _is_ part of
the
> > hash), but the cell doesn't really care what is there (not that it cares
about
> > any other field, but I digress)
> > 
> > uint32 IndexTable::GetFullHash(const IndexCell& cell, uint32 lower_part) {
> >   // It is OK for the high order bits of lower_part to overlap with the
stored
> >   // part of the hash.
> >   if (small_table_)
> >     return (GetCellSmallTableId(cell) << kHashSmallTableShift) | lower_part;
> > 
> >   return (GetCellId(cell) << kHashShift) | lower_part;
> > }
> > 
> > That would change bucket_hash to something else (and bucket_id, cell_id)...
> > maybe bucket_hash becomes bucket_id.
> 
> Sure, that sounds good.  I'd like a comment (or a design doc section, which
may
> already exist :-}) that describes the relationship between these concepts, but
I
> do like separating them out.  And I like not naming them anything with "hash"
in
> it, as that implies a reasonably close, non-complicated relationship to the
> actually hash, and the relationship there *is* a bit complicated.

rvargas (doing something else)

On 2013/12/26 21:45:48, rdsmith wrote: > Next round of comments. > > One high level ...

6 years, 12 months ago (2013-12-26 23:52:33 UTC) #50

Randy Smith (Not in Mondays)

On 2013/12/26 23:44:15, rvargas wrote: > On 2013/12/26 19:37:24, rdsmith wrote: > > On 2013/12/03 ...

6 years, 12 months ago (2013-12-27 15:16:53 UTC) #51

On 2013/12/26 23:44:15, rvargas wrote:
> On 2013/12/26 19:37:24, rdsmith wrote:
> > On 2013/12/03 00:41:32, rvargas wrote:
> > > On 2013/12/02 19:28:01, rdsmith wrote:
> > > > On 2013/11/26 00:48:56, rvargas wrote:
> > > > > On 2013/11/25 19:48:07, rdsmith wrote:
> > > > > > I feel as if both address and hash could benefit from a comment
> > somewhere
> > > > > > indicating that they mean different things in different parts of the
> > code,
> > > > > with
> > > > > > an explanation as to what those differences are (if this is in the
> > design
> > > > doc,
> > > > > > that's fine, and my apologies for forgetting it).
> > > > > 
> > > > > How the hash is used to locate entries on the table (and doubling of
the
> > > > table)
> > > > > is covered on the design doc. But I don't fully follow what you're
> > thinking.
> > > > > 
> > > > > thanks.
> > > > 
> > > > So the design document does cover what the hash is and how it's used to
> > lookup
> > > > entries (and thank you--it was useful for me to go back to the design
doc.
> 
> > I
> > > > think I get more from it each time I go over it after reading the code,
> not
> > > > surprisingly).  But my concern is more that the concepts "hash" and
> > "address"
> > > > mean different things at different places in the code, and I think it's
> easy
> > > for
> > > > the reader to assume that they mean the same thing, and be confused that
> > way. 
> > > > If they were consistently referred to by different names in the code,
that
> > > would
> > > > address my concern.  I'm not sure if this should be a high priority--if
> the
> > > > concepts are clear, it may be that a code reader will be able to track
by
> > > > context which meaning of hash/address is currently relevant.  But that's
> the
> > > > concern I was raising. 
> > > > 
> > > > With regard to addresses, I think my confusion is between what I'll call
> > "full
> > > > addresses" (i.e. what Addr() passes around), and "address values" (what
> you
> > > get
> > > > from IndexCells and EntryCell::GetAddressValue()).  I think it would be
> > > clearer
> > > > if those things used different names consistently.  Does this make sense
> to
> > > you?
> > > >  I think that would mean:
> > > >   * Changing the names of {Get,Set}Cell{SmallTable,}Address &
> > > >     {FileNumber,StartBlock}FromAddress, along with the variables used
> > > >     inside those functions?  
> > > >   * I think based on its usage in IsValidAddress, kMaxAddress would also
> > > >     become kMaxAddressValue.
> > > 
> > > Sure, as I said before, this can be improved by a different name but I
have
> no
> > > good ideas about what to call it.
> > > 
> > > So here's the whole picture:
> > > disk_cache::Addr -> this is the regular address passed around most of the
> code
> > > (I mean, this is visible to most files under disk_cache), and it is the
> > general
> > > address of anything (note that the type is not Address)
> > > 
> > > disk_cache::Addr::value == disk_cache::CacheAddr == uint32. This is what
is
> > > stored inside an Addr, and it is just a number of a fixed length. I
> mentioned
> > > because it has a dedicated type name (and some code uses it, of course),
and
> > > because it "has" the nickname of "address value" as it is accessible as
that
> > > from any Addr.
> > > 
> > > And now the new neighbors:
> > > The address of a Cell -> currently also known as "address", "defined" on
the
> > > file format as that, and with a variable bit length. Originally meant to
be
> > > basically the cell number, or cell index (but index is a very bad name).
> > Should
> > > be fairly private to this file (index_table.cc)
> > > 
> > > With the added twist that once you have an EntyCell, given that it is
meant
> to
> > > be used from the outside, it must provide a for of address, and so it
> > translates
> > > from cell number to Addr (so Address() returns Addr). But on the other
hand,
> > it
> > > cannot hide the short version from other code _from this file_, as this
file
> > is
> > > the one who is aware of the details of how many bits go where, so it
> provides
> > > private methods to get to the "real" cell number.
> > > 
> > > As such, I would not want to go to AddressValue everywhere (even though
> today
> > > that's the closest I have to be able to tell that it means a cell number),
> > > because it is too close to Addr::value.
> > > 
> > > I would be happier if what is stored inside a Cell is a Fraby and so
> Address()
> > > returns Addr and GetFraby returns Fraby. The caveat of course is that
there
> is
> > a
> > > very simple translation from Fraby to CacheAddr.
> > > 
> > > How about Location for anything that relates to the cell number?
> > 
> > That would be fine, but any reason not to use CellNumber?  "Location" sounds
> > more general, and "CellNumber" sounds a bit more like an implementation
> detail,
> > which I think is the more correct impression.  (But "Location" is fine if
you
> > prefer it.)
> 
> 
> I meant "Fraby" -> "Location". In other words, the number that is stored
inside
> a cell would become "Location" instead of "Address".
> 
> CellNumber implies the number of a given cell inside the table (10th cell,
part
> of the 3rd bucket etc).

"Location" is fine.

> 
> 
> 
> > 
> > > 
> > > > 
> > > > With regard to hash, similarly there's the portion of the hash that's
> stored
> > > in
> > > > IndexCell (call it "high hash"), the portion that's implied by the
> position
> > of
> > > > the IndexCell within the IndexTable ("low hash"), and the whole thing
> > (gotten
> > > > from base::Hash() and split up as needed, "full hash").  It looks to me
as
> > if
> > > > what I'm calling the low hash is generally referred to as a bucket hash,
> so
> > it
> > > > may be the only real confusion is between high hash and full hash (for
> > > example,
> > > > the EntryCell constructor takes a full hash and passes a high hash to
> > > > SetCellHash, storing the full hash in hash_).  I'm not sure what would
be
> > > > involved in clearing this up or if it's worth it, but I suspect that
just
> > > using
> > > > different words for the high hash and the full hash (whatever they are)
> > would
> > > > reduce confusion in code readers.
> > > 
> > > Ah, now I see what you meant.
> > > 
> > > On the same lines, I could rename the thing that lives inside a cell as
"id"
> > as
> > > in it is what identifies whatever is stored in that cell. It may be a
little
> > > weird how the id is derived from part of the hash (versus it _is_ part of
> the
> > > hash), but the cell doesn't really care what is there (not that it cares
> about
> > > any other field, but I digress)
> > > 
> > > uint32 IndexTable::GetFullHash(const IndexCell& cell, uint32 lower_part) {
> > >   // It is OK for the high order bits of lower_part to overlap with the
> stored
> > >   // part of the hash.
> > >   if (small_table_)
> > >     return (GetCellSmallTableId(cell) << kHashSmallTableShift) |
lower_part;
> > > 
> > >   return (GetCellId(cell) << kHashShift) | lower_part;
> > > }
> > > 
> > > That would change bucket_hash to something else (and bucket_id,
cell_id)...
> > > maybe bucket_hash becomes bucket_id.
> > 
> > Sure, that sounds good.  I'd like a comment (or a design doc section, which
> may
> > already exist :-}) that describes the relationship between these concepts,
but
> I
> > do like separating them out.  And I like not naming them anything with
"hash"
> in
> > it, as that implies a reasonably close, non-complicated relationship to the
> > actually hash, and the relationship there *is* a bit complicated.

Randy Smith (Not in Mondays)

On 2013/12/26 23:52:33, rvargas wrote: > On 2013/12/26 21:45:48, rdsmith wrote: > > Next round ...

6 years, 12 months ago (2013-12-27 15:19:09 UTC) #52

rvargas (doing something else)

renamed version up. https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index_table.cc File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index_table.cc#newcode473 net/disk_cache/v3/index_table.cc:473: DCHECK_LE(extra_bits_, 11); > > I think ...

6 years, 12 months ago (2013-12-27 19:31:45 UTC) #53

renamed version up.

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1810001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:473: DCHECK_LE(extra_bits_, 11);
> > I think most of my feeling on this topic would be addressed by making
explicit
> > the relationships between the different constants--I don't have strong
general
> > feelings about either where the constants are declared or this particular
> > location being a number rather than a constant 11.  But I would like
constants
> > that are used in multiple places in the code and do have a relationship to
> each
> > other to be expressed in terms of each other.  So for instance,
> disk_format_v3.h
> > would have k{SmallTable,}HashOffset and k{SmallTable,}HashLength, and
> > kHash{SmallTable,}Shift and kCell{SmallTable,}HashMask would be defined from
> > those values (probably in index_table.cc).  Would you be good with that?
> 
> Any thoughts on this request?

I tried to move the conversation to the latest version of the code, where I
added comments about it.

> An extreme example that I don't think I've previously mentioned: kBaseTableLen
> is defined in three different places (backend_impl.cc, backend_impl_v3.cc, and
> backend_worker.cc, disk_format_v3.h), and the definition in the .h file looks
to
> contradict the common definition in the .cc files.  

That is just a transient detail...
- backend_impl.cc: nothing to do about it (old format)
- backend_impl_v3.cc: old, unused code
- backend_worker.cc: either old, unused code, or using the value from
disk_format_v3.h
- disk_format_v3.h: definition for the new format.

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1920001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:519:
backup_header_.reset(params->backup_header.release());
On 2013/12/26 21:45:49, rdsmith wrote:
> On 2013/12/07 02:13:45, rvargas wrote:
> > On 2013/12/05 19:17:48, rdsmith wrote:
> > > On 2013/12/04 01:04:17, rvargas wrote:
> > > > On 2013/12/02 21:48:06, rdsmith wrote:
> > > > > What's the context for these parameters?  They're only used on the
first
> > > call
> > > > to
> > > > > Init(); are they defined on other calls?  And where do they come from
on
> > the
> > > > > first call to Init()?  What do they contain?  It feels a bit weird to
> have
> > > > this
> > > > > function responsible for allocating storage on growing, but not on
first
> > > > > initialization.
> > > > 
> > > > There are dedicated files to store the backups. When the cache starts,
> those
> > > > files are read and the contents are provided to this method, so that we
> can
> > > > check the backup against the mmaped bitmap. When growing, there is no
need
> > to
> > > > read any file because the up to date version is already in memory, owned
> by
> > > this
> > > > object.
> > > 
> > > So that means at some point in the future there will be code checking the
> > backup
> > > bitmaps against the current bitmaps on first load?
> > 
> > Checking against the backup is already in place. It is performed on demand
by
> > CheckState()
> > 
> > 
> > > Also, what's the allocation behavior for params->backup_header and
> > backup_bitmap
> > > on calls to Init() after the first?  There's not a code correctness issue,
> but
> > > it's sorta a shame if they keep getting allocated and then deleted without
> > being
> > > examined.
> > 
> > The caller doesn't pass them. That would require reading the files, and that
> > doesn't happen for reinit.
> 
> Clarifying comment, either on IndexTableInitData or Init()?  
> 
> (Side note: I've been on-and-off tempted reading this code to suggest that it
be
> pulled out into three separate functions for the three separate things that
> could be happening: initialization, re-init without doubling (increase in the
> extra table), and re-init with doubling, because as I've read the function
I've
> had to spend some mental energy tracking those three pathways and make sure
each
> is clean.  But I haven't been able to convince myself that the code would
> actually be cleaner and more maintainable that way; it would involve a lot
more
> functions that each branch would call, and the obvious way to implement it
would
> be for Init() to branch when first entered, so I haven't brought it up.  I'm
> just mentioning it now in case the various requests I'm making for comments to
> help keep those three things straight are onerous enough that the alternative
> has appeal :-}.)

Done.

https://codereview.chromium.org/53313004/diff/1930001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1930001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:471: // extra_bits_ is really measured against
table-size specific values.
This was the answer to making some constants depend on some others (quoted on
the previous comment).

Note that revision 11 adds the comments I mention here.

https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/disk_...
File net/disk_cache/v3/disk_format_v3.h (right):

https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/disk_...
net/disk_cache/v3/disk_format_v3.h:61: CACHE_EVICTED = 1 << 2      // Already
evicted at least one entry.
On 2013/12/26 21:45:49, rdsmith wrote:
> nit, suggestion: It's a bit worrisome to see a non-backward compatible change
in
> a file describing disk format.  I poked around in the code, and as I'm sure
you
> know, this is fine because CACHE_EVICT* aren't used anywhere.  But is there a
> reason not to have the discipline of adding new flags at the end of the enum
so
> that that type of thing doesn't need to be verified?
> 
> (I suppose that strictly speaking you should always bump the version when you
> modify anything in the format, and I can completely understand not wanting to
do
> that during development, and as long as you're in development you're the only
> person who'll be inconvenienced by mistakes in this space.  Thus "nit,
> suggestion" :-}.  But it does look a bit weird from the outside.)

I intend to make 3.0 the version of the file that is released to users (Beta
ideally). Anything before that is not a stable file format, doesn't have users
or even code that can be compiled out of TOT, so we should be free to change
whatever we'd like to without worrying about compatibility.

A flag stating how to read the table should go before one that specifies an
eviction algorithm to be used.

https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:990: // the link.
On 2013/12/26 21:45:49, rdsmith wrote:
> This comment is useful for someone reading through to try and understand
> initialization, but the place where I think the comment would be most useful
is
> in GetNextBucket().  If this dangling pointer ever does cause problems, it'll
be
> there.

Done.

https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.h (right):

https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:146: virtual void SaveIndex(net::IOBuffer*
buffer, int buffer_len) = 0;
On 2013/12/26 21:45:49, rdsmith wrote:
> nit, suggestion: My scan of the code makes me think that we don't actually
need
> the complexity of the IOBuffer here; it looks like OnBackupTimer() just passes
> ownership to the backend (the reference is dropped as soon as SaveIndex() is
> called) and that nothing in the backend uses the refcounting/cancellation
> semantics of IOBuffer.  If that's the case, it would seem simpler to pass a
> scoped_array<char>?

At the end, the buffer is going to be written to disk with one of the normal
write functions (and they all take IOBuffers). The overall premise is always
that the initial source of the data should allocate an IOBuffer instead of
performing copies or wrappers in the middle.

Don't forget that having multiple threads inlolved in IO (and tasks bouncing)
implies that there are cancellation semantics involved, even if they are not
explicit.

https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.h:180: // called just to reinitialize the object.
On 2013/12/26 21:45:49, rdsmith wrote:
> Willing to say a bit about what it should be used for?  I wasn't completely
> clear looking at the usage in the total V3 CL what the purpose of this is; is
it
> just a shutdown method?  In which case calling it "Shutdown()" might make more
> sense.

Done.

https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table_unittest.cc (right):

https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table_unittest.cc:574: // Tests doubling of the table.
On 2013/12/26 21:45:49, rdsmith wrote:
> A couple of cases I'm not certain this test covers that I think would be good
to
> make sure are covered:
> * Doublings that will require moving cells in the table because of the
boundary
> change as to what portion of the hash is used for bucket_id.
> * Cells in the extra table on doubling, which will need to be successfully
> copied into the main table.

Both things are covered by this test.

https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table_unittest.cc:615: // Tests bucket chaining when
growing the index.
On 2013/12/26 21:45:49, rdsmith wrote:
> Why does using the upper half of the table mean we're exercising bucket
> chaining?  Don't we need more than four entries with the same bucket id to
test
> chaining?

Yes, we need more than 4. We have 8 entries in the chain, alternating the bit
that would select entries of the upper half after doubling. So we end up with
two chains of 4 entries each. The second part of the test writes one entry to
each chain, starting with the one on the second half, so the links should be
"inverted" at the end (0 -> second extra, upper half -> first extra). If there
was a problem with the links when doubling, the original chain would be wrongly
chained before receiving the new 5th entry (and could even point to the extra
bucket in use by the other chain, or even the bucket with the other four
entries).

https://codereview.chromium.org/53313004/diff/1950001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table_unittest.cc:706: }
On 2013/12/26 21:45:49, rdsmith wrote:
> Possibly also a test for the logic in CheckState()?

Every lookup goes through that code, so the tests are already covering it.

The only part that is not yet covered is the ENTRY_FIXING, but the backend
doesn't have that code yet so writing a test that calls FixCell seems pointless
so far. BTW, the current tests already verify that FixCell() is never called
because they run with a NULL backend.

Randy Smith (Not in Mondays)

https://codereview.chromium.org/53313004/diff/2050001/net/disk_cache/v3/index_table.cc File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/2050001/net/disk_cache/v3/index_table.cc#newcode25 net/disk_cache/v3/index_table.cc:25: const int kCellIdOffset = 22; How about a comment ...

6 years, 11 months ago (2014-01-06 22:28:34 UTC) #54

https://codereview.chromium.org/53313004/diff/2050001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/2050001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:25: const int kCellIdOffset = 22;
How about a comment before this group of constants indicating that these are
concretizing the description of IndexCell from disk_format_v3.h and need to be
kept in sync with that description?

https://codereview.chromium.org/53313004/diff/2050001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:32: const uint64 kCellLocationMask = 0x3FFFFF;
Suggestion/nit: I'd personally find these easier to understand if a) the Offset
and Mask were near each other, and b) the masks were put in the form of |(1 <<
n) - 1|.  That'd make it very easy for me to see "shift by these number of bytes
and grab this number of bytes".

https://codereview.chromium.org/53313004/diff/2050001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:32: const uint64 kCellLocationMask = 0x3FFFFF;
Suggestion/nit: As sorta implied by the above comment, I'd value having a
kCellLocationOffset = 0, even though it isn't needed.  Currently, the
information that the cell location's offset is zero is contained in
GetCellLocation below; creating a constant is extra verbiage, but collects that
information with other information of its ilk.

https://codereview.chromium.org/53313004/diff/2050001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:42: const int kHashShift = 14;
Comment documenting that these are to turn a hash into an ID, and so not
reflecting the comments in disk_format_v3.h?

https://codereview.chromium.org/53313004/diff/2050001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:210: // that are relinked when growing the
table.
Would you be willing to say something like "When the table is grown, all buckets
are moved from the extra_table to the main table, but the pointers to them from
the main table are not updated.  Those dangling pointers are recognized here
because they point into the main table (i.e. are < min_bucket_num)."  I don't
care about the wording, but I want to be explicit that this is expected in
normal behavior, and is a cleanup from table growth.

As an aside, based on the above, I feel like |bucket_num > max_bucket_num|
suggests table corruption, whereas |bucket_num < min_bucket_num| is expected
cleanup.  Worth separating those out?  Noting the corruption case?

https://codereview.chromium.org/53313004/diff/2050001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:489: // kMaxExtraBitsSmallTable =
base::bits::Log2Floor(max 16 bit table) - 10.
Thank you for these comments.

rvargas (doing something else)

Thanks for your time. https://codereview.chromium.org/53313004/diff/2050001/net/disk_cache/v3/index_table.cc File net/disk_cache/v3/index_table.cc (right): https://codereview.chromium.org/53313004/diff/2050001/net/disk_cache/v3/index_table.cc#newcode25 net/disk_cache/v3/index_table.cc:25: const int kCellIdOffset = 22; ...

6 years, 11 months ago (2014-01-08 01:29:11 UTC) #55

Thanks for your time.

https://codereview.chromium.org/53313004/diff/2050001/net/disk_cache/v3/index...
File net/disk_cache/v3/index_table.cc (right):

https://codereview.chromium.org/53313004/diff/2050001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:25: const int kCellIdOffset = 22;
On 2014/01/06 22:28:35, rdsmith wrote:
> How about a comment before this group of constants indicating that these are
> concretizing the description of IndexCell from disk_format_v3.h and need to be
> kept in sync with that description?

Done.

https://codereview.chromium.org/53313004/diff/2050001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:32: const uint64 kCellLocationMask = 0x3FFFFF;
I tried intermixing the values, but I found that to be very hard to read.
Instead, they are now following the same order of comments of disk_format.h so
it should be trivial to check a mismatch.

As for constants with 0, I can add them if you feel strongly about them, but the
offsets don't specify from where the offset is taken, so it is not like
specifying zeroes define the "value" without looking at the code that uses it.
I'm slightly against adding dead code.

https://codereview.chromium.org/53313004/diff/2050001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:42: const int kHashShift = 14;
On 2014/01/06 22:28:35, rdsmith wrote:
> Comment documenting that these are to turn a hash into an ID, and so not
> reflecting the comments in disk_format_v3.h?

Done.

https://codereview.chromium.org/53313004/diff/2050001/net/disk_cache/v3/index...
net/disk_cache/v3/index_table.cc:210: // that are relinked when growing the
table.
On 2014/01/06 22:28:35, rdsmith wrote:
> Would you be willing to say something like "When the table is grown, all
buckets
> are moved from the extra_table to the main table, but the pointers to them
from
> the main table are not updated.  Those dangling pointers are recognized here
> because they point into the main table (i.e. are < min_bucket_num)."  I don't
> care about the wording, but I want to be explicit that this is expected in
> normal behavior, and is a cleanup from table growth.
> 
> As an aside, based on the above, I feel like |bucket_num > max_bucket_num|
> suggests table corruption, whereas |bucket_num < min_bucket_num| is expected
> cleanup.  Worth separating those out?  Noting the corruption case?

Extended the comment. I'm not sure what would be the advantage of separating the
conditions, given that the action is the same. "corruption" is a little loose
term given that a big part of v3 is that there is no corruption... as in the
code will just fix whatever needs to be fixed and ignores whatever cannot be
fixed.

Randy Smith (Not in Mondays)

LGTM. Sorry it's taken so long :-{.

6 years, 11 months ago (2014-01-08 19:10:02 UTC) #56

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/rvargas@chromium.org/53313004/2470001

6 years, 11 months ago (2014-01-09 21:54:34 UTC) #57

Message was sent while issue was closed.

Change committed as 244027

Issue 53313004: Disk cache v3: The main index table. (Closed)

Description

Patch Set 1 : #

Patch Set 2 : #

Patch Set 3 : add comments #

Patch Set 4 : #

Patch Set 5 : Unify UpdateIterator #

Patch Set 6 : WalkTables #

Patch Set 7 : More comments #

Patch Set 8 : remove From*Address use #

Patch Set 9 : #

Patch Set 10 : for(;;) & Co #

Patch Set 11 : #

Patch Set 12 : rename address and hash #

Patch Set 13 : #

Patch Set 14 : Rebase #

Patch Set 15 : #

Messages