Issue 2969123002: Add deduplication logic to .pak files

agrieve

agrieve@chromium.org changed reviewers: + flackr@chromium.org, sadrul@chromium.org

3 years, 5 months ago (2017-07-05 18:07:26 UTC) #1

flackr

I have an alternate proposal to having a bit mean an ID instead of a ...

3 years, 5 months ago (2017-07-05 20:24:03 UTC) #3

agrieve

On 2017/07/05 20:24:03, flackr wrote: > I have an alternate proposal to having a bit ...

3 years, 5 months ago (2017-07-05 20:37:05 UTC) #4

flackr

On 2017/07/05 20:37:05, agrieve wrote: > On 2017/07/05 20:24:03, flackr wrote: > > I have ...

3 years, 5 months ago (2017-07-05 20:41:25 UTC) #5

flackr

On 2017/07/05 20:41:25, flackr wrote: > On 2017/07/05 20:37:05, agrieve wrote: > > On 2017/07/05 ...

3 years, 5 months ago (2017-07-05 20:44:28 UTC) #6

On 2017/07/05 20:41:25, flackr wrote:
> On 2017/07/05 20:37:05, agrieve wrote:
> > On 2017/07/05 20:24:03, flackr wrote:
> > > I have an alternate proposal to having a bit mean an ID instead of a
> location.
> > > What if instead we allowed repeated id's and then we could just point to
the
> > > location of the earlier resource in the file?
> > > 
> > > e.g. If ID3 uses the same resource as ID1:
> > > [ID1, L1, ID2, L2, ID2, L4, ID3, L1, ID3, L2, ID4, L4, 0, EOF]
> > > 
> > > Then we just have to make sure that the first occurrence of an id is used
> but
> > > otherwise the file format doesn't have special cases.
> > > 
> > > This requires a bit of additional space as you have to repeat header
entries
> > for
> > > the back reference and for the previous resource (so it can compute its
> > length).
> > 
> > Interesting idea! This would still require a version bump though wouldn't
it?
> > The fact that the first ID must be used is a breaking change, as right now
the
> > bsearch() doesn't guarantee that. If we're going to bump the version code
> > anyways, I think I'd prefer not to add any bytes that are not necessary
> (unless
> > it would improve performance, but I don't think this does)
> 
> It has several perf benefits:
> It eliminates the second binary search when you discover you have a reference.
> Perf and reduced code complexity in not having to loop past reference ids to
> find the next non-reference one.
> 
> The cost is only an extra 6 bytes per reference. Technically you could avoid
it
> by detecting back references and skipping over them but that adds a lot of the
> same code complexity and some additional generation complexity because you
have
> to detect when you can't correctly detect that.

Oops, sent too early, 12 bytes for random back references, but only 6 additional
bytes if you have repeated back references.

But yes, I think we would have to bump the revision still since it's not
backwards compatible.

flackr

On 2017/07/05 20:44:28, flackr wrote: > On 2017/07/05 20:41:25, flackr wrote: > > On 2017/07/05 ...

3 years, 5 months ago (2017-07-05 20:51:25 UTC) #7

On 2017/07/05 20:44:28, flackr wrote:
> On 2017/07/05 20:41:25, flackr wrote:
> > On 2017/07/05 20:37:05, agrieve wrote:
> > > On 2017/07/05 20:24:03, flackr wrote:
> > > > I have an alternate proposal to having a bit mean an ID instead of a
> > location.
> > > > What if instead we allowed repeated id's and then we could just point to
> the
> > > > location of the earlier resource in the file?
> > > > 
> > > > e.g. If ID3 uses the same resource as ID1:
> > > > [ID1, L1, ID2, L2, ID2, L4, ID3, L1, ID3, L2, ID4, L4, 0, EOF]
> > > > 
> > > > Then we just have to make sure that the first occurrence of an id is
used
> > but
> > > > otherwise the file format doesn't have special cases.
> > > > 
> > > > This requires a bit of additional space as you have to repeat header
> entries
> > > for
> > > > the back reference and for the previous resource (so it can compute its
> > > length).
> > > 
> > > Interesting idea! This would still require a version bump though wouldn't
> it?
> > > The fact that the first ID must be used is a breaking change, as right now
> the
> > > bsearch() doesn't guarantee that. If we're going to bump the version code
> > > anyways, I think I'd prefer not to add any bytes that are not necessary
> > (unless
> > > it would improve performance, but I don't think this does)
> > 
> > It has several perf benefits:
> > It eliminates the second binary search when you discover you have a
reference.
> > Perf and reduced code complexity in not having to loop past reference ids to
> > find the next non-reference one.
> > 
> > The cost is only an extra 6 bytes per reference. Technically you could avoid
> it
> > by detecting back references and skipping over them but that adds a lot of
the
> > same code complexity and some additional generation complexity because you
> have
> > to detect when you can't correctly detect that.
> 
> Oops, sent too early, 12 bytes for random back references, but only 6
additional
> bytes if you have repeated back references.
> 
> But yes, I think we would have to bump the revision still since it's not
> backwards compatible.

Alternately, if we assume the number of references to other id's is low, we save
more table of contents space by having a second (id -> id) map (only 4 bytes per
entry) at the cost of an additional TOC lookup.

agrieve

On 2017/07/05 20:51:25, flackr wrote: > On 2017/07/05 20:44:28, flackr wrote: > > On 2017/07/05 ...

3 years, 5 months ago (2017-07-06 03:50:04 UTC) #8

On 2017/07/05 20:51:25, flackr wrote:
> On 2017/07/05 20:44:28, flackr wrote:
> > On 2017/07/05 20:41:25, flackr wrote:
> > > On 2017/07/05 20:37:05, agrieve wrote:
> > > > On 2017/07/05 20:24:03, flackr wrote:
> > > > > I have an alternate proposal to having a bit mean an ID instead of a
> > > location.
> > > > > What if instead we allowed repeated id's and then we could just point
to
> > the
> > > > > location of the earlier resource in the file?
> > > > > 
> > > > > e.g. If ID3 uses the same resource as ID1:
> > > > > [ID1, L1, ID2, L2, ID2, L4, ID3, L1, ID3, L2, ID4, L4, 0, EOF]
> > > > > 
> > > > > Then we just have to make sure that the first occurrence of an id is
> used
> > > but
> > > > > otherwise the file format doesn't have special cases.
> > > > > 
> > > > > This requires a bit of additional space as you have to repeat header
> > entries
> > > > for
> > > > > the back reference and for the previous resource (so it can compute
its
> > > > length).
> > > > 
> > > > Interesting idea! This would still require a version bump though
wouldn't
> > it?
> > > > The fact that the first ID must be used is a breaking change, as right
now
> > the
> > > > bsearch() doesn't guarantee that. If we're going to bump the version
code
> > > > anyways, I think I'd prefer not to add any bytes that are not necessary
> > > (unless
> > > > it would improve performance, but I don't think this does)
> > > 
> > > It has several perf benefits:
> > > It eliminates the second binary search when you discover you have a
> reference.
> > > Perf and reduced code complexity in not having to loop past reference ids
to
> > > find the next non-reference one.
> > > 
> > > The cost is only an extra 6 bytes per reference. Technically you could
avoid
> > it
> > > by detecting back references and skipping over them but that adds a lot of
> the
> > > same code complexity and some additional generation complexity because you
> > have
> > > to detect when you can't correctly detect that.
> > 
> > Oops, sent too early, 12 bytes for random back references, but only 6
> additional
> > bytes if you have repeated back references.
> > 
> > But yes, I think we would have to bump the revision still since it's not
> > backwards compatible.
> 
> Alternately, if we assume the number of references to other id's is low, we
save
> more table of contents space by having a second (id -> id) map (only 4 bytes
per
> entry) at the cost of an additional TOC lookup.

I think it's safe to assume the number of references is very low.

Note: there is no need for a 2nd binary search. It doesn't encode the aliased
id, but rather the index into the table of the alias.

One nice thing about encoding the aliased entry is that you could report which
IDs are aliases without having to create a reverse map of offset->id. Not sure
it's terribly useful, but could maybe add it to pak_util.py's dump output. Come
to think of it, I think I will do this :)

I did consider adding a 2nd table for aliases (option 1 in the linked bug), but
figured the size savings wouldn't be worth having to have the pak reading logic
have "if (version == 5) blocks in it". It wouldn't be much of a code size
difference though, so if you feel strongly that it's a better solution I'll
switch to that.

flackr

On 2017/07/06 03:50:04, agrieve wrote: > On 2017/07/05 20:51:25, flackr wrote: > > On 2017/07/05 ...

3 years, 5 months ago (2017-07-06 13:35:19 UTC) #9

On 2017/07/06 03:50:04, agrieve wrote:
> On 2017/07/05 20:51:25, flackr wrote:
> > On 2017/07/05 20:44:28, flackr wrote:
> > > On 2017/07/05 20:41:25, flackr wrote:
> > > > On 2017/07/05 20:37:05, agrieve wrote:
> > > > > On 2017/07/05 20:24:03, flackr wrote:
> > > > > > I have an alternate proposal to having a bit mean an ID instead of a
> > > > location.
> > > > > > What if instead we allowed repeated id's and then we could just
point
> to
> > > the
> > > > > > location of the earlier resource in the file?
> > > > > > 
> > > > > > e.g. If ID3 uses the same resource as ID1:
> > > > > > [ID1, L1, ID2, L2, ID2, L4, ID3, L1, ID3, L2, ID4, L4, 0, EOF]
> > > > > > 
> > > > > > Then we just have to make sure that the first occurrence of an id is
> > used
> > > > but
> > > > > > otherwise the file format doesn't have special cases.
> > > > > > 
> > > > > > This requires a bit of additional space as you have to repeat header
> > > entries
> > > > > for
> > > > > > the back reference and for the previous resource (so it can compute
> its
> > > > > length).
> > > > > 
> > > > > Interesting idea! This would still require a version bump though
> wouldn't
> > > it?
> > > > > The fact that the first ID must be used is a breaking change, as right
> now
> > > the
> > > > > bsearch() doesn't guarantee that. If we're going to bump the version
> code
> > > > > anyways, I think I'd prefer not to add any bytes that are not
necessary
> > > > (unless
> > > > > it would improve performance, but I don't think this does)
> > > > 
> > > > It has several perf benefits:
> > > > It eliminates the second binary search when you discover you have a
> > reference.
> > > > Perf and reduced code complexity in not having to loop past reference
ids
> to
> > > > find the next non-reference one.
> > > > 
> > > > The cost is only an extra 6 bytes per reference. Technically you could
> avoid
> > > it
> > > > by detecting back references and skipping over them but that adds a lot
of
> > the
> > > > same code complexity and some additional generation complexity because
you
> > > have
> > > > to detect when you can't correctly detect that.
> > > 
> > > Oops, sent too early, 12 bytes for random back references, but only 6
> > additional
> > > bytes if you have repeated back references.
> > > 
> > > But yes, I think we would have to bump the revision still since it's not
> > > backwards compatible.
> > 
> > Alternately, if we assume the number of references to other id's is low, we
> save
> > more table of contents space by having a second (id -> id) map (only 4 bytes
> per
> > entry) at the cost of an additional TOC lookup.
> 
> I think it's safe to assume the number of references is very low.
> 
> Note: there is no need for a 2nd binary search. It doesn't encode the aliased
> id, but rather the index into the table of the alias.

Oh I see, what about storing the location to the table of contents entry? Then
at least we're storing a uniform type of value (no need for an extra bit) and if
it's less than the end of the table of contents you know to treat it
differently?

> 
> One nice thing about encoding the aliased entry is that you could report which
> IDs are aliases without having to create a reverse map of offset->id. Not sure
> it's terribly useful, but could maybe add it to pak_util.py's dump output.
Come
> to think of it, I think I will do this :)
> 
> I did consider adding a 2nd table for aliases (option 1 in the linked bug),
but
> figured the size savings wouldn't be worth having to have the pak reading
logic
> have "if (version == 5) blocks in it". It wouldn't be much of a code size
> difference though, so if you feel strongly that it's a better solution I'll
> switch to that.

agrieve

Description was changed from ========== Add deduplication logic to .pak files Now, when multiple entries ...

3 years, 5 months ago (2017-07-06 17:56:29 UTC) #10

agrieve

On 2017/07/06 13:35:19, flackr wrote: > On 2017/07/06 03:50:04, agrieve wrote: > > On 2017/07/05 ...

3 years, 5 months ago (2017-07-06 17:56:56 UTC) #11

On 2017/07/06 13:35:19, flackr wrote:
> On 2017/07/06 03:50:04, agrieve wrote:
> > On 2017/07/05 20:51:25, flackr wrote:
> > > On 2017/07/05 20:44:28, flackr wrote:
> > > > On 2017/07/05 20:41:25, flackr wrote:
> > > > > On 2017/07/05 20:37:05, agrieve wrote:
> > > > > > On 2017/07/05 20:24:03, flackr wrote:
> > > > > > > I have an alternate proposal to having a bit mean an ID instead of
a
> > > > > location.
> > > > > > > What if instead we allowed repeated id's and then we could just
> point
> > to
> > > > the
> > > > > > > location of the earlier resource in the file?
> > > > > > > 
> > > > > > > e.g. If ID3 uses the same resource as ID1:
> > > > > > > [ID1, L1, ID2, L2, ID2, L4, ID3, L1, ID3, L2, ID4, L4, 0, EOF]
> > > > > > > 
> > > > > > > Then we just have to make sure that the first occurrence of an id
is
> > > used
> > > > > but
> > > > > > > otherwise the file format doesn't have special cases.
> > > > > > > 
> > > > > > > This requires a bit of additional space as you have to repeat
header
> > > > entries
> > > > > > for
> > > > > > > the back reference and for the previous resource (so it can
compute
> > its
> > > > > > length).
> > > > > > 
> > > > > > Interesting idea! This would still require a version bump though
> > wouldn't
> > > > it?
> > > > > > The fact that the first ID must be used is a breaking change, as
right
> > now
> > > > the
> > > > > > bsearch() doesn't guarantee that. If we're going to bump the version
> > code
> > > > > > anyways, I think I'd prefer not to add any bytes that are not
> necessary
> > > > > (unless
> > > > > > it would improve performance, but I don't think this does)
> > > > > 
> > > > > It has several perf benefits:
> > > > > It eliminates the second binary search when you discover you have a
> > > reference.
> > > > > Perf and reduced code complexity in not having to loop past reference
> ids
> > to
> > > > > find the next non-reference one.
> > > > > 
> > > > > The cost is only an extra 6 bytes per reference. Technically you could
> > avoid
> > > > it
> > > > > by detecting back references and skipping over them but that adds a
lot
> of
> > > the
> > > > > same code complexity and some additional generation complexity because
> you
> > > > have
> > > > > to detect when you can't correctly detect that.
> > > > 
> > > > Oops, sent too early, 12 bytes for random back references, but only 6
> > > additional
> > > > bytes if you have repeated back references.
> > > > 
> > > > But yes, I think we would have to bump the revision still since it's not
> > > > backwards compatible.
> > > 
> > > Alternately, if we assume the number of references to other id's is low,
we
> > save
> > > more table of contents space by having a second (id -> id) map (only 4
bytes
> > per
> > > entry) at the cost of an additional TOC lookup.
> > 
> > I think it's safe to assume the number of references is very low.
> > 
> > Note: there is no need for a 2nd binary search. It doesn't encode the
aliased
> > id, but rather the index into the table of the alias.
> 
> Oh I see, what about storing the location to the table of contents entry? Then
> at least we're storing a uniform type of value (no need for an extra bit) and
if
> it's less than the end of the table of contents you know to treat it
> differently?
> 
> > 
> > One nice thing about encoding the aliased entry is that you could report
which
> > IDs are aliases without having to create a reverse map of offset->id. Not
sure
> > it's terribly useful, but could maybe add it to pak_util.py's dump output.
> Come
> > to think of it, I think I will do this :)
> > 
> > I did consider adding a 2nd table for aliases (option 1 in the linked bug),
> but
> > figured the size savings wouldn't be worth having to have the pak reading
> logic
> > have "if (version == 5) blocks in it". It wouldn't be much of a code size
> > difference though, so if you feel strongly that it's a better solution I'll
> > switch to that.

Switched over to using an alias map. ptal :)

agrieve

The CQ bit was checked by agrieve@chromium.org to run a CQ dry run

3 years, 5 months ago (2017-07-06 19:02:18 UTC) #12

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2969123002/100001

3 years, 5 months ago (2017-07-06 19:02:35 UTC) #13

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 5 months ago (2017-07-06 20:31:46 UTC) #14

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_chromeos_ozone_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_ozone_rel_ng/builds/422361)

3 years, 5 months ago (2017-07-06 20:31:47 UTC) #15

flackr

https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format/data_pack.py File tools/grit/grit/format/data_pack.py (right): https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format/data_pack.py#newcode80 tools/grit/grit/format/data_pack.py:80: prev_offset = entry_at_index(resource_count)[1] It's a little strange to not ...

3 years, 5 months ago (2017-07-07 18:54:13 UTC) #16

agrieve

One comment to resolve still, but heading off on vacation. Sadrul - ptal. https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format/data_pack.py File ...

3 years, 5 months ago (2017-07-07 20:47:09 UTC) #17

One comment to resolve still, but heading off on vacation. Sadrul - ptal.

https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format...
File tools/grit/grit/format/data_pack.py (right):

https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format...
tools/grit/grit/format/data_pack.py:80: prev_offset =
entry_at_index(resource_count)[1]
On 2017/07/07 18:54:12, flackr wrote:
> It's a little strange to not read in a forwards direction. I think it might be
> cleaner to store cur_entry and for each next_entry (i.e. 1 to count) add the
> cur_entry.

ah, yeah, that's nicer :)

https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format...
tools/grit/grit/format/data_pack.py:107: id_by_data = {resources[k]: k for k in
reversed(resource_ids)}
On 2017/07/07 18:54:12, flackr wrote:
> Comment why we use reversed - presumably want lowest k which resource maps to.

Done.

https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format...
tools/grit/grit/format/data_pack.py:114: # Note: 2nd and 4th byte are always 0
since version and encoding < 256.
On 2017/07/07 18:54:12, flackr wrote:
> Version I could see may someday be > 256, I can just imagine the day when
we're
> sad to have to make a breaking change because we can't create grit files with
> version 65536. But why grow encoding to a short?

Expanded the comment. I just speculated that things may go faster if the tables
are not aligned on odd-numbered bytes.

https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format...
tools/grit/grit/format/data_pack.py:119: # Each main table entry is a uint16 + a
uint32.
On 2017/07/07 18:54:12, flackr wrote:
> Preserve the comment that we have an extra entry for the last item.

Done.

https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format...
File tools/grit/grit/format/data_pack_unittest.py (right):

https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format...
tools/grit/grit/format/data_pack_unittest.py:45: '\x03\x00\x01\x00'             
    # resource_count, alias_count,
On 2017/07/07 18:54:12, flackr wrote:
> nit: I think the test will read more cleanly with one value per line than
> putting both (version, encoding) and (resource_count, alias_count) into single
> lines.

Done.

https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_...
File ui/base/resource/data_pack.cc (right):

https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_...
ui/base/resource/data_pack.cc:30: static const size_t kHeaderLengthV5 = 4 *
sizeof(uint16_t);
On 2017/07/07 18:54:12, flackr wrote:
> This seems to be masquerading the actual data it refers to, i.e. the encoding
> type is still a byte is it not? It's nice to have the comment explaining where
> the length comes from.

I think it's actually simpler to just say encoding is 2 bytes. Added comment.

https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_...
ui/base/resource/data_pack.cc:207: if (data_source->GetLength() <
kHeaderLengthV4) {
On 2017/07/07 18:54:12, flackr wrote:
> This means we don't get HEADER_TRUNCATED for an 8 byte v5 header. How about
> moving the full header check to after we know the version and do an early
check
> that at least the header version is readable?

I think the main point of this check is just to ensure we don't read past eof. I
don't think it's worth adding precious code size for an 8 byte v5 header as long
as this still results in an error and we don't read past eof (which I believe is
the case)

https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_...
ui/base/resource/data_pack.cc:215: int version = data[0];
On 2017/07/07 18:54:12, flackr wrote:
> I think we should keep the version length the same so that we don't have code
> which assumes the header version is the first byte. Do you think the extra two
> bytes is a concern?

I'm not sure what your concern is here. The code assumes endianness all over.

https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_...
ui/base/resource/data_pack.cc:262: // entry after the last item which gives us
the length of the last item.
On 2017/07/07 18:54:12, flackr wrote:
> Also verify the indices of the alias table are within the resource_count.

Done.

https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_...
ui/base/resource/data_pack.cc:277: const Entry* ret = reinterpret_cast<const
Entry*>(
On 2017/07/07 18:54:12, flackr wrote:
> nit: Add a comment explaining that we search the resource table first as most
> resources will be in there, and then the alias table for those which were
> duplicates.

Done.

https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_...
ui/base/resource/data_pack.cc:377: // TODO(agrieve): Is there any benefit to
writing a v5 file (with aliases)?
On 2017/07/07 18:54:12, flackr wrote:
> Oh this is interesting. Maybe we should change this to write a v5 file with an
> empty aliases table for now with a TODO. Maybe that'd mean that someday we
won't
> have to support v4?

Sounds good. Just heading out now, but will address later.

flackr

https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format/data_pack.py File tools/grit/grit/format/data_pack.py (right): https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format/data_pack.py#newcode114 tools/grit/grit/format/data_pack.py:114: # Note: 2nd and 4th byte are always 0 ...

3 years, 5 months ago (2017-07-10 14:07:50 UTC) #18

agrieve

https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format/data_pack.py File tools/grit/grit/format/data_pack.py (right): https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format/data_pack.py#newcode114 tools/grit/grit/format/data_pack.py:114: # Note: 2nd and 4th byte are always 0 ...

3 years, 5 months ago (2017-07-18 19:30:29 UTC) #19

https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format...
File tools/grit/grit/format/data_pack.py (right):

https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format...
tools/grit/grit/format/data_pack.py:114: # Note: 2nd and 4th byte are always 0
since version and encoding < 256.
On 2017/07/10 14:07:49, flackr wrote:
> On 2017/07/07 20:47:08, agrieve wrote:
> > On 2017/07/07 18:54:12, flackr wrote:
> > > Version I could see may someday be > 256, I can just imagine the day when
> > we're
> > > sad to have to make a breaking change because we can't create grit files
> with
> > > version 65536. But why grow encoding to a short?
> > 
> > Expanded the comment. I just speculated that things may go faster if the
> tables
> > are not aligned on odd-numbered bytes.
> 
> I suppose that could help, but it's better IMO to insert an intentional gap so
> that we know there is a gap there later if/when we need to add something else.

Done.

https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_...
File ui/base/resource/data_pack.cc (right):

https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_...
ui/base/resource/data_pack.cc:215: int version = data[0];
On 2017/07/10 14:07:49, flackr wrote:
> On 2017/07/07 20:47:08, agrieve wrote:
> > On 2017/07/07 18:54:12, flackr wrote:
> > > I think we should keep the version length the same so that we don't have
> code
> > > which assumes the header version is the first byte. Do you think the extra
> two
> > > bytes is a concern?
> > 
> > I'm not sure what your concern is here. The code assumes endianness all
over.
> 
> I'm worried it's bad practice to shorten the version number field on a file
> format, I don't think saving a couple bytes justifies doing it. For example,
> this code will now read version 261 as the same as 5, and old code will read
> version 5 files as the sum of 5 and the encoding * 2^16.

Yeah, I suppose a few bytes per-file isn't worth changing.

https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_...
ui/base/resource/data_pack.cc:377: // TODO(agrieve): Is there any benefit to
writing a v5 file (with aliases)?
On 2017/07/07 20:47:09, agrieve wrote:
> On 2017/07/07 18:54:12, flackr wrote:
> > Oh this is interesting. Maybe we should change this to write a v5 file with
an
> > empty aliases table for now with a TODO. Maybe that'd mean that someday we
> won't
> > have to support v4?
> 
> Sounds good. Just heading out now, but will address later.

Done.

flackr

https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_pack.cc File ui/base/resource/data_pack.cc (right): https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_pack.cc#newcode207 ui/base/resource/data_pack.cc:207: if (data_source->GetLength() < kHeaderLengthV4) { On 2017/07/07 20:47:09, agrieve ...

3 years, 5 months ago (2017-07-18 20:21:23 UTC) #20

agrieve

https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_pack.cc File ui/base/resource/data_pack.cc (right): https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_pack.cc#newcode207 ui/base/resource/data_pack.cc:207: if (data_source->GetLength() < kHeaderLengthV4) { On 2017/07/18 20:21:23, flackr ...

3 years, 5 months ago (2017-07-19 02:34:05 UTC) #21

flackr

LGTM with nit, thanks for bearing with me on the concerns. https://codereview.chromium.org/2969123002/diff/160001/ui/base/resource/data_pack.cc File ui/base/resource/data_pack.cc (right): ...

3 years, 5 months ago (2017-07-19 03:05:04 UTC) #22

agrieve

And thanks for putting thought into the reviews! Sadrul - ptal. https://codereview.chromium.org/2969123002/diff/160001/ui/base/resource/data_pack.cc File ui/base/resource/data_pack.cc (right): ...

3 years, 5 months ago (2017-07-19 14:59:01 UTC) #23

agrieve

The CQ bit was checked by agrieve@chromium.org to run a CQ dry run

3 years, 5 months ago (2017-07-19 14:59:19 UTC) #24

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2969123002/180001

3 years, 5 months ago (2017-07-19 14:59:24 UTC) #25

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 5 months ago (2017-07-19 15:54:36 UTC) #26

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: ios-simulator on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator/builds/261856)

3 years, 5 months ago (2017-07-19 15:54:42 UTC) #27

agrieve

The CQ bit was checked by agrieve@chromium.org to run a CQ dry run

3 years, 5 months ago (2017-07-19 17:46:29 UTC) #28

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2969123002/180001

3 years, 5 months ago (2017-07-19 17:46:35 UTC) #29

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 5 months ago (2017-07-19 18:17:34 UTC) #30

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 5 months ago (2017-07-19 18:17:35 UTC) #31

agrieve

The patchset sent to the CQ was uploaded after l-g-t-m from flackr@chromium.org Link to the ...

3 years, 5 months ago (2017-07-20 12:42:21 UTC) #34

commit-bot: I haz the power

CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2969123002/180001

3 years, 5 months ago (2017-07-20 12:42:28 UTC) #35

commit-bot: I haz the power

CQ is committing da patch. Bot data: {"patchset_id": 180001, "attempt_start_ts": 1500554540567170, "parent_rev": "a83609260044160cb960f95b6f12346161e2f248", "commit_rev": "ade347f539a378fb37500f6d17e8835edc1d8ec0"}

3 years, 5 months ago (2017-07-20 12:48:14 UTC) #36

commit-bot: I haz the power

Description was changed from ========== Add deduplication logic to .pak files Now, when multiple entries ...

3 years, 5 months ago (2017-07-20 12:48:32 UTC) #37

commit-bot: I haz the power

Committed patchset #10 (id:180001) as https://chromium.googlesource.com/chromium/src/+/ade347f539a378fb37500f6d17e8835edc1d8ec0

3 years, 5 months ago (2017-07-20 12:48:33 UTC) #38

achuithb

On 2017/07/20 12:48:33, commit-bot: I haz the power wrote: > Committed patchset #10 (id:180001) as ...

3 years, 5 months ago (2017-07-20 23:53:20 UTC) #39

achuithb

3 years, 5 months ago (2017-07-20 23:55:40 UTC) #40

Message was sent while issue was closed.

A revert of this CL (patchset #10 id:180001) has been created in
https://codereview.chromium.org/2989443002/ by achuith@chromium.org.

The reason for reverting is:
https://bugs.chromium.org/p/chromium/issues/detail?id=747171.

Issue 2969123002: Add deduplication logic to .pak files (Closed)

Description

Patch Set 1 #

Patch Set 2 : Add deduplication logic to .pak files #

Patch Set 3 : self review #

Patch Set 4 : Convert to alias table #

Patch Set 5 : update pak_util.py #

Patch Set 6 : fix resource_sizes computation #

Patch Set 7 : review comments #

Patch Set 8 : Bigger header, write a v5 #

Patch Set 9 : review comments #

Patch Set 10 : sizeof() #

Messages