|
|
DescriptionAdd deduplication logic to .pak files
Now, when multiple entries contain the same content, multiple
table-of-contents entries will be created, but only one data region.
As of now, en-US.pak has ~3200 entries, and 350 of them are duplicates.
For MonochromePublic.apk, shrinks uncompressed .pak size by ~130kb, and
compressed .pak size by 32kb.
BUG=738566
Review-Url: https://codereview.chromium.org/2969123002
Cr-Commit-Position: refs/heads/master@{#488215}
Committed: https://chromium.googlesource.com/chromium/src/+/ade347f539a378fb37500f6d17e8835edc1d8ec0
Patch Set 1 #Patch Set 2 : Add deduplication logic to .pak files #Patch Set 3 : self review #Patch Set 4 : Convert to alias table #Patch Set 5 : update pak_util.py #Patch Set 6 : fix resource_sizes computation #
Total comments: 31
Patch Set 7 : review comments #Patch Set 8 : Bigger header, write a v5 #
Total comments: 2
Patch Set 9 : review comments #
Total comments: 2
Patch Set 10 : sizeof() #
Messages
Total messages: 40 (18 generated)
agrieve@chromium.org changed reviewers: + flackr@chromium.org, sadrul@chromium.org
📟
I have an alternate proposal to having a bit mean an ID instead of a location. What if instead we allowed repeated id's and then we could just point to the location of the earlier resource in the file? e.g. If ID3 uses the same resource as ID1: [ID1, L1, ID2, L2, ID2, L4, ID3, L1, ID3, L2, ID4, L4, 0, EOF] Then we just have to make sure that the first occurrence of an id is used but otherwise the file format doesn't have special cases. This requires a bit of additional space as you have to repeat header entries for the back reference and for the previous resource (so it can compute its length).
On 2017/07/05 20:24:03, flackr wrote: > I have an alternate proposal to having a bit mean an ID instead of a location. > What if instead we allowed repeated id's and then we could just point to the > location of the earlier resource in the file? > > e.g. If ID3 uses the same resource as ID1: > [ID1, L1, ID2, L2, ID2, L4, ID3, L1, ID3, L2, ID4, L4, 0, EOF] > > Then we just have to make sure that the first occurrence of an id is used but > otherwise the file format doesn't have special cases. > > This requires a bit of additional space as you have to repeat header entries for > the back reference and for the previous resource (so it can compute its length). Interesting idea! This would still require a version bump though wouldn't it? The fact that the first ID must be used is a breaking change, as right now the bsearch() doesn't guarantee that. If we're going to bump the version code anyways, I think I'd prefer not to add any bytes that are not necessary (unless it would improve performance, but I don't think this does)
On 2017/07/05 20:37:05, agrieve wrote: > On 2017/07/05 20:24:03, flackr wrote: > > I have an alternate proposal to having a bit mean an ID instead of a location. > > What if instead we allowed repeated id's and then we could just point to the > > location of the earlier resource in the file? > > > > e.g. If ID3 uses the same resource as ID1: > > [ID1, L1, ID2, L2, ID2, L4, ID3, L1, ID3, L2, ID4, L4, 0, EOF] > > > > Then we just have to make sure that the first occurrence of an id is used but > > otherwise the file format doesn't have special cases. > > > > This requires a bit of additional space as you have to repeat header entries > for > > the back reference and for the previous resource (so it can compute its > length). > > Interesting idea! This would still require a version bump though wouldn't it? > The fact that the first ID must be used is a breaking change, as right now the > bsearch() doesn't guarantee that. If we're going to bump the version code > anyways, I think I'd prefer not to add any bytes that are not necessary (unless > it would improve performance, but I don't think this does) It has several perf benefits: It eliminates the second binary search when you discover you have a reference. Perf and reduced code complexity in not having to loop past reference ids to find the next non-reference one. The cost is only an extra 6 bytes per reference. Technically you could avoid it by detecting back references and skipping over them but that adds a lot of the same code complexity and some additional generation complexity because you have to detect when you can't correctly detect that.
On 2017/07/05 20:41:25, flackr wrote: > On 2017/07/05 20:37:05, agrieve wrote: > > On 2017/07/05 20:24:03, flackr wrote: > > > I have an alternate proposal to having a bit mean an ID instead of a > location. > > > What if instead we allowed repeated id's and then we could just point to the > > > location of the earlier resource in the file? > > > > > > e.g. If ID3 uses the same resource as ID1: > > > [ID1, L1, ID2, L2, ID2, L4, ID3, L1, ID3, L2, ID4, L4, 0, EOF] > > > > > > Then we just have to make sure that the first occurrence of an id is used > but > > > otherwise the file format doesn't have special cases. > > > > > > This requires a bit of additional space as you have to repeat header entries > > for > > > the back reference and for the previous resource (so it can compute its > > length). > > > > Interesting idea! This would still require a version bump though wouldn't it? > > The fact that the first ID must be used is a breaking change, as right now the > > bsearch() doesn't guarantee that. If we're going to bump the version code > > anyways, I think I'd prefer not to add any bytes that are not necessary > (unless > > it would improve performance, but I don't think this does) > > It has several perf benefits: > It eliminates the second binary search when you discover you have a reference. > Perf and reduced code complexity in not having to loop past reference ids to > find the next non-reference one. > > The cost is only an extra 6 bytes per reference. Technically you could avoid it > by detecting back references and skipping over them but that adds a lot of the > same code complexity and some additional generation complexity because you have > to detect when you can't correctly detect that. Oops, sent too early, 12 bytes for random back references, but only 6 additional bytes if you have repeated back references. But yes, I think we would have to bump the revision still since it's not backwards compatible.
On 2017/07/05 20:44:28, flackr wrote: > On 2017/07/05 20:41:25, flackr wrote: > > On 2017/07/05 20:37:05, agrieve wrote: > > > On 2017/07/05 20:24:03, flackr wrote: > > > > I have an alternate proposal to having a bit mean an ID instead of a > > location. > > > > What if instead we allowed repeated id's and then we could just point to > the > > > > location of the earlier resource in the file? > > > > > > > > e.g. If ID3 uses the same resource as ID1: > > > > [ID1, L1, ID2, L2, ID2, L4, ID3, L1, ID3, L2, ID4, L4, 0, EOF] > > > > > > > > Then we just have to make sure that the first occurrence of an id is used > > but > > > > otherwise the file format doesn't have special cases. > > > > > > > > This requires a bit of additional space as you have to repeat header > entries > > > for > > > > the back reference and for the previous resource (so it can compute its > > > length). > > > > > > Interesting idea! This would still require a version bump though wouldn't > it? > > > The fact that the first ID must be used is a breaking change, as right now > the > > > bsearch() doesn't guarantee that. If we're going to bump the version code > > > anyways, I think I'd prefer not to add any bytes that are not necessary > > (unless > > > it would improve performance, but I don't think this does) > > > > It has several perf benefits: > > It eliminates the second binary search when you discover you have a reference. > > Perf and reduced code complexity in not having to loop past reference ids to > > find the next non-reference one. > > > > The cost is only an extra 6 bytes per reference. Technically you could avoid > it > > by detecting back references and skipping over them but that adds a lot of the > > same code complexity and some additional generation complexity because you > have > > to detect when you can't correctly detect that. > > Oops, sent too early, 12 bytes for random back references, but only 6 additional > bytes if you have repeated back references. > > But yes, I think we would have to bump the revision still since it's not > backwards compatible. Alternately, if we assume the number of references to other id's is low, we save more table of contents space by having a second (id -> id) map (only 4 bytes per entry) at the cost of an additional TOC lookup.
On 2017/07/05 20:51:25, flackr wrote: > On 2017/07/05 20:44:28, flackr wrote: > > On 2017/07/05 20:41:25, flackr wrote: > > > On 2017/07/05 20:37:05, agrieve wrote: > > > > On 2017/07/05 20:24:03, flackr wrote: > > > > > I have an alternate proposal to having a bit mean an ID instead of a > > > location. > > > > > What if instead we allowed repeated id's and then we could just point to > > the > > > > > location of the earlier resource in the file? > > > > > > > > > > e.g. If ID3 uses the same resource as ID1: > > > > > [ID1, L1, ID2, L2, ID2, L4, ID3, L1, ID3, L2, ID4, L4, 0, EOF] > > > > > > > > > > Then we just have to make sure that the first occurrence of an id is > used > > > but > > > > > otherwise the file format doesn't have special cases. > > > > > > > > > > This requires a bit of additional space as you have to repeat header > > entries > > > > for > > > > > the back reference and for the previous resource (so it can compute its > > > > length). > > > > > > > > Interesting idea! This would still require a version bump though wouldn't > > it? > > > > The fact that the first ID must be used is a breaking change, as right now > > the > > > > bsearch() doesn't guarantee that. If we're going to bump the version code > > > > anyways, I think I'd prefer not to add any bytes that are not necessary > > > (unless > > > > it would improve performance, but I don't think this does) > > > > > > It has several perf benefits: > > > It eliminates the second binary search when you discover you have a > reference. > > > Perf and reduced code complexity in not having to loop past reference ids to > > > find the next non-reference one. > > > > > > The cost is only an extra 6 bytes per reference. Technically you could avoid > > it > > > by detecting back references and skipping over them but that adds a lot of > the > > > same code complexity and some additional generation complexity because you > > have > > > to detect when you can't correctly detect that. > > > > Oops, sent too early, 12 bytes for random back references, but only 6 > additional > > bytes if you have repeated back references. > > > > But yes, I think we would have to bump the revision still since it's not > > backwards compatible. > > Alternately, if we assume the number of references to other id's is low, we save > more table of contents space by having a second (id -> id) map (only 4 bytes per > entry) at the cost of an additional TOC lookup. I think it's safe to assume the number of references is very low. Note: there is no need for a 2nd binary search. It doesn't encode the aliased id, but rather the index into the table of the alias. One nice thing about encoding the aliased entry is that you could report which IDs are aliases without having to create a reverse map of offset->id. Not sure it's terribly useful, but could maybe add it to pak_util.py's dump output. Come to think of it, I think I will do this :) I did consider adding a 2nd table for aliases (option 1 in the linked bug), but figured the size savings wouldn't be worth having to have the pak reading logic have "if (version == 5) blocks in it". It wouldn't be much of a code size difference though, so if you feel strongly that it's a better solution I'll switch to that.
On 2017/07/06 03:50:04, agrieve wrote: > On 2017/07/05 20:51:25, flackr wrote: > > On 2017/07/05 20:44:28, flackr wrote: > > > On 2017/07/05 20:41:25, flackr wrote: > > > > On 2017/07/05 20:37:05, agrieve wrote: > > > > > On 2017/07/05 20:24:03, flackr wrote: > > > > > > I have an alternate proposal to having a bit mean an ID instead of a > > > > location. > > > > > > What if instead we allowed repeated id's and then we could just point > to > > > the > > > > > > location of the earlier resource in the file? > > > > > > > > > > > > e.g. If ID3 uses the same resource as ID1: > > > > > > [ID1, L1, ID2, L2, ID2, L4, ID3, L1, ID3, L2, ID4, L4, 0, EOF] > > > > > > > > > > > > Then we just have to make sure that the first occurrence of an id is > > used > > > > but > > > > > > otherwise the file format doesn't have special cases. > > > > > > > > > > > > This requires a bit of additional space as you have to repeat header > > > entries > > > > > for > > > > > > the back reference and for the previous resource (so it can compute > its > > > > > length). > > > > > > > > > > Interesting idea! This would still require a version bump though > wouldn't > > > it? > > > > > The fact that the first ID must be used is a breaking change, as right > now > > > the > > > > > bsearch() doesn't guarantee that. If we're going to bump the version > code > > > > > anyways, I think I'd prefer not to add any bytes that are not necessary > > > > (unless > > > > > it would improve performance, but I don't think this does) > > > > > > > > It has several perf benefits: > > > > It eliminates the second binary search when you discover you have a > > reference. > > > > Perf and reduced code complexity in not having to loop past reference ids > to > > > > find the next non-reference one. > > > > > > > > The cost is only an extra 6 bytes per reference. Technically you could > avoid > > > it > > > > by detecting back references and skipping over them but that adds a lot of > > the > > > > same code complexity and some additional generation complexity because you > > > have > > > > to detect when you can't correctly detect that. > > > > > > Oops, sent too early, 12 bytes for random back references, but only 6 > > additional > > > bytes if you have repeated back references. > > > > > > But yes, I think we would have to bump the revision still since it's not > > > backwards compatible. > > > > Alternately, if we assume the number of references to other id's is low, we > save > > more table of contents space by having a second (id -> id) map (only 4 bytes > per > > entry) at the cost of an additional TOC lookup. > > I think it's safe to assume the number of references is very low. > > Note: there is no need for a 2nd binary search. It doesn't encode the aliased > id, but rather the index into the table of the alias. Oh I see, what about storing the location to the table of contents entry? Then at least we're storing a uniform type of value (no need for an extra bit) and if it's less than the end of the table of contents you know to treat it differently? > > One nice thing about encoding the aliased entry is that you could report which > IDs are aliases without having to create a reverse map of offset->id. Not sure > it's terribly useful, but could maybe add it to pak_util.py's dump output. Come > to think of it, I think I will do this :) > > I did consider adding a 2nd table for aliases (option 1 in the linked bug), but > figured the size savings wouldn't be worth having to have the pak reading logic > have "if (version == 5) blocks in it". It wouldn't be much of a code size > difference though, so if you feel strongly that it's a better solution I'll > switch to that.
Description was changed from ========== Add deduplication logic to .pak files Now, when multiple entries contain the same content, multiple table-of-contents entries will be created, but only one data region. For MonochromePublic.apk, shrinks uncompressed .pak size by ~130kb, and compressed .pak size by 32kb. BUG=738566 ========== to ========== Add deduplication logic to .pak files Now, when multiple entries contain the same content, multiple table-of-contents entries will be created, but only one data region. As of now, en-US.pak has ~3200 entries, and 350 of them are duplicates. For MonochromePublic.apk, shrinks uncompressed .pak size by ~130kb, and compressed .pak size by 32kb. BUG=738566 ==========
On 2017/07/06 13:35:19, flackr wrote: > On 2017/07/06 03:50:04, agrieve wrote: > > On 2017/07/05 20:51:25, flackr wrote: > > > On 2017/07/05 20:44:28, flackr wrote: > > > > On 2017/07/05 20:41:25, flackr wrote: > > > > > On 2017/07/05 20:37:05, agrieve wrote: > > > > > > On 2017/07/05 20:24:03, flackr wrote: > > > > > > > I have an alternate proposal to having a bit mean an ID instead of a > > > > > location. > > > > > > > What if instead we allowed repeated id's and then we could just > point > > to > > > > the > > > > > > > location of the earlier resource in the file? > > > > > > > > > > > > > > e.g. If ID3 uses the same resource as ID1: > > > > > > > [ID1, L1, ID2, L2, ID2, L4, ID3, L1, ID3, L2, ID4, L4, 0, EOF] > > > > > > > > > > > > > > Then we just have to make sure that the first occurrence of an id is > > > used > > > > > but > > > > > > > otherwise the file format doesn't have special cases. > > > > > > > > > > > > > > This requires a bit of additional space as you have to repeat header > > > > entries > > > > > > for > > > > > > > the back reference and for the previous resource (so it can compute > > its > > > > > > length). > > > > > > > > > > > > Interesting idea! This would still require a version bump though > > wouldn't > > > > it? > > > > > > The fact that the first ID must be used is a breaking change, as right > > now > > > > the > > > > > > bsearch() doesn't guarantee that. If we're going to bump the version > > code > > > > > > anyways, I think I'd prefer not to add any bytes that are not > necessary > > > > > (unless > > > > > > it would improve performance, but I don't think this does) > > > > > > > > > > It has several perf benefits: > > > > > It eliminates the second binary search when you discover you have a > > > reference. > > > > > Perf and reduced code complexity in not having to loop past reference > ids > > to > > > > > find the next non-reference one. > > > > > > > > > > The cost is only an extra 6 bytes per reference. Technically you could > > avoid > > > > it > > > > > by detecting back references and skipping over them but that adds a lot > of > > > the > > > > > same code complexity and some additional generation complexity because > you > > > > have > > > > > to detect when you can't correctly detect that. > > > > > > > > Oops, sent too early, 12 bytes for random back references, but only 6 > > > additional > > > > bytes if you have repeated back references. > > > > > > > > But yes, I think we would have to bump the revision still since it's not > > > > backwards compatible. > > > > > > Alternately, if we assume the number of references to other id's is low, we > > save > > > more table of contents space by having a second (id -> id) map (only 4 bytes > > per > > > entry) at the cost of an additional TOC lookup. > > > > I think it's safe to assume the number of references is very low. > > > > Note: there is no need for a 2nd binary search. It doesn't encode the aliased > > id, but rather the index into the table of the alias. > > Oh I see, what about storing the location to the table of contents entry? Then > at least we're storing a uniform type of value (no need for an extra bit) and if > it's less than the end of the table of contents you know to treat it > differently? > > > > > One nice thing about encoding the aliased entry is that you could report which > > IDs are aliases without having to create a reverse map of offset->id. Not sure > > it's terribly useful, but could maybe add it to pak_util.py's dump output. > Come > > to think of it, I think I will do this :) > > > > I did consider adding a 2nd table for aliases (option 1 in the linked bug), > but > > figured the size savings wouldn't be worth having to have the pak reading > logic > > have "if (version == 5) blocks in it". It wouldn't be much of a code size > > difference though, so if you feel strongly that it's a better solution I'll > > switch to that. Switched over to using an alias map. ptal :)
The CQ bit was checked by agrieve@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_chromeos_ozone_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format... File tools/grit/grit/format/data_pack.py (right): https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format... tools/grit/grit/format/data_pack.py:80: prev_offset = entry_at_index(resource_count)[1] It's a little strange to not read in a forwards direction. I think it might be cleaner to store cur_entry and for each next_entry (i.e. 1 to count) add the cur_entry. https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format... tools/grit/grit/format/data_pack.py:107: id_by_data = {resources[k]: k for k in reversed(resource_ids)} Comment why we use reversed - presumably want lowest k which resource maps to. https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format... tools/grit/grit/format/data_pack.py:114: # Note: 2nd and 4th byte are always 0 since version and encoding < 256. Version I could see may someday be > 256, I can just imagine the day when we're sad to have to make a breaking change because we can't create grit files with version 65536. But why grow encoding to a short? https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format... tools/grit/grit/format/data_pack.py:119: # Each main table entry is a uint16 + a uint32. Preserve the comment that we have an extra entry for the last item. https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format... File tools/grit/grit/format/data_pack_unittest.py (right): https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format... tools/grit/grit/format/data_pack_unittest.py:45: '\x03\x00\x01\x00' # resource_count, alias_count, nit: I think the test will read more cleanly with one value per line than putting both (version, encoding) and (resource_count, alias_count) into single lines. https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_... File ui/base/resource/data_pack.cc (right): https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_... ui/base/resource/data_pack.cc:30: static const size_t kHeaderLengthV5 = 4 * sizeof(uint16_t); This seems to be masquerading the actual data it refers to, i.e. the encoding type is still a byte is it not? It's nice to have the comment explaining where the length comes from. https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_... ui/base/resource/data_pack.cc:207: if (data_source->GetLength() < kHeaderLengthV4) { This means we don't get HEADER_TRUNCATED for an 8 byte v5 header. How about moving the full header check to after we know the version and do an early check that at least the header version is readable? https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_... ui/base/resource/data_pack.cc:215: int version = data[0]; I think we should keep the version length the same so that we don't have code which assumes the header version is the first byte. Do you think the extra two bytes is a concern? https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_... ui/base/resource/data_pack.cc:262: // entry after the last item which gives us the length of the last item. Also verify the indices of the alias table are within the resource_count. https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_... ui/base/resource/data_pack.cc:277: const Entry* ret = reinterpret_cast<const Entry*>( nit: Add a comment explaining that we search the resource table first as most resources will be in there, and then the alias table for those which were duplicates. https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_... ui/base/resource/data_pack.cc:377: // TODO(agrieve): Is there any benefit to writing a v5 file (with aliases)? Oh this is interesting. Maybe we should change this to write a v5 file with an empty aliases table for now with a TODO. Maybe that'd mean that someday we won't have to support v4? https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_... File ui/base/resource/data_pack_literal.cc (right): https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_... ui/base/resource/data_pack_literal.cc:26: 0x05, 0x00, 0x01, 0x00, // version, encoding Same comment, use separate lines for multiple values for readability. https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_... File ui/base/resource/data_pack_unittest.cc (right): https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_... ui/base/resource/data_pack_unittest.cc:183: ASSERT_TRUE(pack.GetStringPiece(10, &data)); It's probably worth expecting that this points to the same location as id 4.
One comment to resolve still, but heading off on vacation. Sadrul - ptal. https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format... File tools/grit/grit/format/data_pack.py (right): https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format... tools/grit/grit/format/data_pack.py:80: prev_offset = entry_at_index(resource_count)[1] On 2017/07/07 18:54:12, flackr wrote: > It's a little strange to not read in a forwards direction. I think it might be > cleaner to store cur_entry and for each next_entry (i.e. 1 to count) add the > cur_entry. ah, yeah, that's nicer :) https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format... tools/grit/grit/format/data_pack.py:107: id_by_data = {resources[k]: k for k in reversed(resource_ids)} On 2017/07/07 18:54:12, flackr wrote: > Comment why we use reversed - presumably want lowest k which resource maps to. Done. https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format... tools/grit/grit/format/data_pack.py:114: # Note: 2nd and 4th byte are always 0 since version and encoding < 256. On 2017/07/07 18:54:12, flackr wrote: > Version I could see may someday be > 256, I can just imagine the day when we're > sad to have to make a breaking change because we can't create grit files with > version 65536. But why grow encoding to a short? Expanded the comment. I just speculated that things may go faster if the tables are not aligned on odd-numbered bytes. https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format... tools/grit/grit/format/data_pack.py:119: # Each main table entry is a uint16 + a uint32. On 2017/07/07 18:54:12, flackr wrote: > Preserve the comment that we have an extra entry for the last item. Done. https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format... File tools/grit/grit/format/data_pack_unittest.py (right): https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format... tools/grit/grit/format/data_pack_unittest.py:45: '\x03\x00\x01\x00' # resource_count, alias_count, On 2017/07/07 18:54:12, flackr wrote: > nit: I think the test will read more cleanly with one value per line than > putting both (version, encoding) and (resource_count, alias_count) into single > lines. Done. https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_... File ui/base/resource/data_pack.cc (right): https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_... ui/base/resource/data_pack.cc:30: static const size_t kHeaderLengthV5 = 4 * sizeof(uint16_t); On 2017/07/07 18:54:12, flackr wrote: > This seems to be masquerading the actual data it refers to, i.e. the encoding > type is still a byte is it not? It's nice to have the comment explaining where > the length comes from. I think it's actually simpler to just say encoding is 2 bytes. Added comment. https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_... ui/base/resource/data_pack.cc:207: if (data_source->GetLength() < kHeaderLengthV4) { On 2017/07/07 18:54:12, flackr wrote: > This means we don't get HEADER_TRUNCATED for an 8 byte v5 header. How about > moving the full header check to after we know the version and do an early check > that at least the header version is readable? I think the main point of this check is just to ensure we don't read past eof. I don't think it's worth adding precious code size for an 8 byte v5 header as long as this still results in an error and we don't read past eof (which I believe is the case) https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_... ui/base/resource/data_pack.cc:215: int version = data[0]; On 2017/07/07 18:54:12, flackr wrote: > I think we should keep the version length the same so that we don't have code > which assumes the header version is the first byte. Do you think the extra two > bytes is a concern? I'm not sure what your concern is here. The code assumes endianness all over. https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_... ui/base/resource/data_pack.cc:262: // entry after the last item which gives us the length of the last item. On 2017/07/07 18:54:12, flackr wrote: > Also verify the indices of the alias table are within the resource_count. Done. https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_... ui/base/resource/data_pack.cc:277: const Entry* ret = reinterpret_cast<const Entry*>( On 2017/07/07 18:54:12, flackr wrote: > nit: Add a comment explaining that we search the resource table first as most > resources will be in there, and then the alias table for those which were > duplicates. Done. https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_... ui/base/resource/data_pack.cc:377: // TODO(agrieve): Is there any benefit to writing a v5 file (with aliases)? On 2017/07/07 18:54:12, flackr wrote: > Oh this is interesting. Maybe we should change this to write a v5 file with an > empty aliases table for now with a TODO. Maybe that'd mean that someday we won't > have to support v4? Sounds good. Just heading out now, but will address later.
https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format... File tools/grit/grit/format/data_pack.py (right): https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format... tools/grit/grit/format/data_pack.py:114: # Note: 2nd and 4th byte are always 0 since version and encoding < 256. On 2017/07/07 20:47:08, agrieve wrote: > On 2017/07/07 18:54:12, flackr wrote: > > Version I could see may someday be > 256, I can just imagine the day when > we're > > sad to have to make a breaking change because we can't create grit files with > > version 65536. But why grow encoding to a short? > > Expanded the comment. I just speculated that things may go faster if the tables > are not aligned on odd-numbered bytes. I suppose that could help, but it's better IMO to insert an intentional gap so that we know there is a gap there later if/when we need to add something else. https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_... File ui/base/resource/data_pack.cc (right): https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_... ui/base/resource/data_pack.cc:215: int version = data[0]; On 2017/07/07 20:47:08, agrieve wrote: > On 2017/07/07 18:54:12, flackr wrote: > > I think we should keep the version length the same so that we don't have code > > which assumes the header version is the first byte. Do you think the extra two > > bytes is a concern? > > I'm not sure what your concern is here. The code assumes endianness all over. I'm worried it's bad practice to shorten the version number field on a file format, I don't think saving a couple bytes justifies doing it. For example, this code will now read version 261 as the same as 5, and old code will read version 5 files as the sum of 5 and the encoding * 2^16.
https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format... File tools/grit/grit/format/data_pack.py (right): https://codereview.chromium.org/2969123002/diff/100001/tools/grit/grit/format... tools/grit/grit/format/data_pack.py:114: # Note: 2nd and 4th byte are always 0 since version and encoding < 256. On 2017/07/10 14:07:49, flackr wrote: > On 2017/07/07 20:47:08, agrieve wrote: > > On 2017/07/07 18:54:12, flackr wrote: > > > Version I could see may someday be > 256, I can just imagine the day when > > we're > > > sad to have to make a breaking change because we can't create grit files > with > > > version 65536. But why grow encoding to a short? > > > > Expanded the comment. I just speculated that things may go faster if the > tables > > are not aligned on odd-numbered bytes. > > I suppose that could help, but it's better IMO to insert an intentional gap so > that we know there is a gap there later if/when we need to add something else. Done. https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_... File ui/base/resource/data_pack.cc (right): https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_... ui/base/resource/data_pack.cc:215: int version = data[0]; On 2017/07/10 14:07:49, flackr wrote: > On 2017/07/07 20:47:08, agrieve wrote: > > On 2017/07/07 18:54:12, flackr wrote: > > > I think we should keep the version length the same so that we don't have > code > > > which assumes the header version is the first byte. Do you think the extra > two > > > bytes is a concern? > > > > I'm not sure what your concern is here. The code assumes endianness all over. > > I'm worried it's bad practice to shorten the version number field on a file > format, I don't think saving a couple bytes justifies doing it. For example, > this code will now read version 261 as the same as 5, and old code will read > version 5 files as the sum of 5 and the encoding * 2^16. Yeah, I suppose a few bytes per-file isn't worth changing. https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_... ui/base/resource/data_pack.cc:377: // TODO(agrieve): Is there any benefit to writing a v5 file (with aliases)? On 2017/07/07 20:47:09, agrieve wrote: > On 2017/07/07 18:54:12, flackr wrote: > > Oh this is interesting. Maybe we should change this to write a v5 file with an > > empty aliases table for now with a TODO. Maybe that'd mean that someday we > won't > > have to support v4? > > Sounds good. Just heading out now, but will address later. Done.
https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_... File ui/base/resource/data_pack.cc (right): https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_... ui/base/resource/data_pack.cc:207: if (data_source->GetLength() < kHeaderLengthV4) { On 2017/07/07 20:47:09, agrieve wrote: > On 2017/07/07 18:54:12, flackr wrote: > > This means we don't get HEADER_TRUNCATED for an 8 byte v5 header. How about > > moving the full header check to after we know the version and do an early > check > > that at least the header version is readable? > > I think the main point of this check is just to ensure we don't read past eof. I > don't think it's worth adding precious code size for an 8 byte v5 header as long > as this still results in an error and we don't read past eof (which I believe is > the case) It does mean that on a truncated v4 file with just the header we'll now get the message that the header has been truncated. The main code size concern is the const string resource right? We can avoid this by having a conditional on the version read. int version = 0; if (data_source->GetLength() >= 4) version = reinterpret_cast<const uint32_t*>(data)[0]; if (version == 0 || (version == 4 && length < kHeaderLengthV4) || (version == 5 && length < kHeaderLengthV5)) Display header truncated error. https://codereview.chromium.org/2969123002/diff/140001/ui/base/resource/data_... File ui/base/resource/data_pack.cc (right): https://codereview.chromium.org/2969123002/diff/140001/ui/base/resource/data_... ui/base/resource/data_pack.cc:281: << "Alias #" << i << " past end of file."; nit: s/end of file/end of resource table
https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_... File ui/base/resource/data_pack.cc (right): https://codereview.chromium.org/2969123002/diff/100001/ui/base/resource/data_... ui/base/resource/data_pack.cc:207: if (data_source->GetLength() < kHeaderLengthV4) { On 2017/07/18 20:21:23, flackr wrote: > On 2017/07/07 20:47:09, agrieve wrote: > > On 2017/07/07 18:54:12, flackr wrote: > > > This means we don't get HEADER_TRUNCATED for an 8 byte v5 header. How about > > > moving the full header check to after we know the version and do an early > > check > > > that at least the header version is readable? > > > > I think the main point of this check is just to ensure we don't read past eof. > I > > don't think it's worth adding precious code size for an 8 byte v5 header as > long > > as this still results in an error and we don't read past eof (which I believe > is > > the case) > > It does mean that on a truncated v4 file with just the header we'll now get the > message that the header has been truncated. > > The main code size concern is the const string resource right? We can avoid this > by having a conditional on the version read. > > int version = 0; > if (data_source->GetLength() >= 4) > version = reinterpret_cast<const uint32_t*>(data)[0]; > if (version == 0 || (version == 4 && length < kHeaderLengthV4) || (version == 5 > && length < kHeaderLengthV5)) > Display header truncated error. Done. https://codereview.chromium.org/2969123002/diff/140001/ui/base/resource/data_... File ui/base/resource/data_pack.cc (right): https://codereview.chromium.org/2969123002/diff/140001/ui/base/resource/data_... ui/base/resource/data_pack.cc:281: << "Alias #" << i << " past end of file."; On 2017/07/18 20:21:23, flackr wrote: > nit: s/end of file/end of resource table changed them both to just "past end"
LGTM with nit, thanks for bearing with me on the concerns. https://codereview.chromium.org/2969123002/diff/160001/ui/base/resource/data_... File ui/base/resource/data_pack.cc (right): https://codereview.chromium.org/2969123002/diff/160001/ui/base/resource/data_... ui/base/resource/data_pack.cc:213: if (data_length > 4) nit: >= sizeof(uint32_t)
And thanks for putting thought into the reviews! Sadrul - ptal. https://codereview.chromium.org/2969123002/diff/160001/ui/base/resource/data_... File ui/base/resource/data_pack.cc (right): https://codereview.chromium.org/2969123002/diff/160001/ui/base/resource/data_... ui/base/resource/data_pack.cc:213: if (data_length > 4) On 2017/07/19 03:05:04, flackr wrote: > nit: >= sizeof(uint32_t) Done.
The CQ bit was checked by agrieve@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: ios-simulator on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator/bui...)
The CQ bit was checked by agrieve@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
lgtm
The CQ bit was checked by agrieve@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from flackr@chromium.org Link to the patchset: https://codereview.chromium.org/2969123002/#ps180001 (title: "sizeof()")
CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
CQ is committing da patch. Bot data: {"patchset_id": 180001, "attempt_start_ts": 1500554540567170, "parent_rev": "a83609260044160cb960f95b6f12346161e2f248", "commit_rev": "ade347f539a378fb37500f6d17e8835edc1d8ec0"}
Message was sent while issue was closed.
Description was changed from ========== Add deduplication logic to .pak files Now, when multiple entries contain the same content, multiple table-of-contents entries will be created, but only one data region. As of now, en-US.pak has ~3200 entries, and 350 of them are duplicates. For MonochromePublic.apk, shrinks uncompressed .pak size by ~130kb, and compressed .pak size by 32kb. BUG=738566 ========== to ========== Add deduplication logic to .pak files Now, when multiple entries contain the same content, multiple table-of-contents entries will be created, but only one data region. As of now, en-US.pak has ~3200 entries, and 350 of them are duplicates. For MonochromePublic.apk, shrinks uncompressed .pak size by ~130kb, and compressed .pak size by 32kb. BUG=738566 Review-Url: https://codereview.chromium.org/2969123002 Cr-Commit-Position: refs/heads/master@{#488215} Committed: https://chromium.googlesource.com/chromium/src/+/ade347f539a378fb37500f6d17e8... ==========
Message was sent while issue was closed.
Committed patchset #10 (id:180001) as https://chromium.googlesource.com/chromium/src/+/ade347f539a378fb37500f6d17e8...
Message was sent while issue was closed.
On 2017/07/20 12:48:33, commit-bot: I haz the power wrote: > Committed patchset #10 (id:180001) as > https://chromium.googlesource.com/chromium/src/+/ade347f539a378fb37500f6d17e8... This is causing a build failure on chromeos: https://bugs.chromium.org/p/chromium/issues/detail?id=747171
Message was sent while issue was closed.
A revert of this CL (patchset #10 id:180001) has been created in https://codereview.chromium.org/2989443002/ by achuith@chromium.org. The reason for reverting is: https://bugs.chromium.org/p/chromium/issues/detail?id=747171. |