OLD | NEW |
(Empty) | |
| 1 |
| 2 The .xz File Format |
| 3 =================== |
| 4 |
| 5 Version 1.0.4 (2009-08-27) |
| 6 |
| 7 |
| 8 0. Preface |
| 9 0.1. Notices and Acknowledgements |
| 10 0.2. Getting the Latest Version |
| 11 0.3. Version History |
| 12 1. Conventions |
| 13 1.1. Byte and Its Representation |
| 14 1.2. Multibyte Integers |
| 15 2. Overall Structure of .xz File |
| 16 2.1. Stream |
| 17 2.1.1. Stream Header |
| 18 2.1.1.1. Header Magic Bytes |
| 19 2.1.1.2. Stream Flags |
| 20 2.1.1.3. CRC32 |
| 21 2.1.2. Stream Footer |
| 22 2.1.2.1. CRC32 |
| 23 2.1.2.2. Backward Size |
| 24 2.1.2.3. Stream Flags |
| 25 2.1.2.4. Footer Magic Bytes |
| 26 2.2. Stream Padding |
| 27 3. Block |
| 28 3.1. Block Header |
| 29 3.1.1. Block Header Size |
| 30 3.1.2. Block Flags |
| 31 3.1.3. Compressed Size |
| 32 3.1.4. Uncompressed Size |
| 33 3.1.5. List of Filter Flags |
| 34 3.1.6. Header Padding |
| 35 3.1.7. CRC32 |
| 36 3.2. Compressed Data |
| 37 3.3. Block Padding |
| 38 3.4. Check |
| 39 4. Index |
| 40 4.1. Index Indicator |
| 41 4.2. Number of Records |
| 42 4.3. List of Records |
| 43 4.3.1. Unpadded Size |
| 44 4.3.2. Uncompressed Size |
| 45 4.4. Index Padding |
| 46 4.5. CRC32 |
| 47 5. Filter Chains |
| 48 5.1. Alignment |
| 49 5.2. Security |
| 50 5.3. Filters |
| 51 5.3.1. LZMA2 |
| 52 5.3.2. Branch/Call/Jump Filters for Executables |
| 53 5.3.3. Delta |
| 54 5.3.3.1. Format of the Encoded Output |
| 55 5.4. Custom Filter IDs |
| 56 5.4.1. Reserved Custom Filter ID Ranges |
| 57 6. Cyclic Redundancy Checks |
| 58 7. References |
| 59 |
| 60 |
| 61 0. Preface |
| 62 |
| 63 This document describes the .xz file format (filename suffix |
| 64 ".xz", MIME type "application/x-xz"). It is intended that this |
| 65 this format replace the old .lzma format used by LZMA SDK and |
| 66 LZMA Utils. |
| 67 |
| 68 |
| 69 0.1. Notices and Acknowledgements |
| 70 |
| 71 This file format was designed by Lasse Collin |
| 72 <lasse.collin@tukaani.org> and Igor Pavlov. |
| 73 |
| 74 Special thanks for helping with this document goes to |
| 75 Ville Koskinen. Thanks for helping with this document goes to |
| 76 Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius. |
| 77 |
| 78 This document has been put into the public domain. |
| 79 |
| 80 |
| 81 0.2. Getting the Latest Version |
| 82 |
| 83 The latest official version of this document can be downloaded |
| 84 from <http://tukaani.org/xz/xz-file-format.txt>. |
| 85 |
| 86 Specific versions of this document have a filename |
| 87 xz-file-format-X.Y.Z.txt where X.Y.Z is the version number. |
| 88 For example, the version 1.0.0 of this document is available |
| 89 at <http://tukaani.org/xz/xz-file-format-1.0.0.txt>. |
| 90 |
| 91 |
| 92 0.3. Version History |
| 93 |
| 94 Version Date Description |
| 95 |
| 96 1.0.4 2009-08-27 Language improvements in Sections 1.2, |
| 97 2.1.1.2, 3.1.1, 3.1.2, and 5.3.1 |
| 98 |
| 99 1.0.3 2009-06-05 Spelling fixes in Sections 5.1 and 5.4 |
| 100 |
| 101 1.0.2 2009-06-04 Typo fixes in Sections 4 and 5.3.1 |
| 102 |
| 103 1.0.1 2009-06-01 Typo fix in Section 0.3 and minor |
| 104 clarifications to Sections 2, 2.2, |
| 105 3.3, 4.4, and 5.3.2 |
| 106 |
| 107 1.0.0 2009-01-14 The first official version |
| 108 |
| 109 |
| 110 1. Conventions |
| 111 |
| 112 The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", |
| 113 "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this |
| 114 document are to be interpreted as described in [RFC-2119]. |
| 115 |
| 116 Indicating a warning means displaying a message, returning |
| 117 appropriate exit status, or doing something else to let the |
| 118 user know that something worth warning occurred. The operation |
| 119 SHOULD still finish if a warning is indicated. |
| 120 |
| 121 Indicating an error means displaying a message, returning |
| 122 appropriate exit status, or doing something else to let the |
| 123 user know that something prevented successfully finishing the |
| 124 operation. The operation MUST be aborted once an error has |
| 125 been indicated. |
| 126 |
| 127 |
| 128 1.1. Byte and Its Representation |
| 129 |
| 130 In this document, byte is always 8 bits. |
| 131 |
| 132 A "null byte" has all bits unset. That is, the value of a null |
| 133 byte is 0x00. |
| 134 |
| 135 To represent byte blocks, this document uses notation that |
| 136 is similar to the notation used in [RFC-1952]: |
| 137 |
| 138 +-------+ |
| 139 | Foo | One byte. |
| 140 +-------+ |
| 141 |
| 142 +---+---+ |
| 143 | Foo | Two bytes; that is, some of the vertical bars |
| 144 +---+---+ can be missing. |
| 145 |
| 146 +=======+ |
| 147 | Foo | Zero or more bytes. |
| 148 +=======+ |
| 149 |
| 150 In this document, a boxed byte or a byte sequence declared |
| 151 using this notation is called "a field". The example field |
| 152 above would be called "the Foo field" or plain "Foo". |
| 153 |
| 154 If there are many fields, they may be split to multiple lines. |
| 155 This is indicated with an arrow ("--->"): |
| 156 |
| 157 +=====+ |
| 158 | Foo | |
| 159 +=====+ |
| 160 |
| 161 +=====+ |
| 162 ---> | Bar | |
| 163 +=====+ |
| 164 |
| 165 The above is equivalent to this: |
| 166 |
| 167 +=====+=====+ |
| 168 | Foo | Bar | |
| 169 +=====+=====+ |
| 170 |
| 171 |
| 172 1.2. Multibyte Integers |
| 173 |
| 174 Multibyte integers of static length, such as CRC values, |
| 175 are stored in little endian byte order (least significant |
| 176 byte first). |
| 177 |
| 178 When smaller values are more likely than bigger values (for |
| 179 example file sizes), multibyte integers are encoded in a |
| 180 variable-length representation: |
| 181 - Numbers in the range [0, 127] are copied as is, and take |
| 182 one byte of space. |
| 183 - Bigger numbers will occupy two or more bytes. All but the |
| 184 last byte of the multibyte representation have the highest |
| 185 (eighth) bit set. |
| 186 |
| 187 For now, the value of the variable-length integers is limited |
| 188 to 63 bits, which limits the encoded size of the integer to |
| 189 nine bytes. These limits may be increased in the future if |
| 190 needed. |
| 191 |
| 192 The following C code illustrates encoding and decoding of |
| 193 variable-length integers. The functions return the number of |
| 194 bytes occupied by the integer (1-9), or zero on error. |
| 195 |
| 196 #include <stddef.h> |
| 197 #include <inttypes.h> |
| 198 |
| 199 size_t |
| 200 encode(uint8_t buf[static 9], uint64_t num) |
| 201 { |
| 202 if (num > UINT64_MAX / 2) |
| 203 return 0; |
| 204 |
| 205 size_t i = 0; |
| 206 |
| 207 while (num >= 0x80) { |
| 208 buf[i++] = (uint8_t)(num) | 0x80; |
| 209 num >>= 7; |
| 210 } |
| 211 |
| 212 buf[i++] = (uint8_t)(num); |
| 213 |
| 214 return i; |
| 215 } |
| 216 |
| 217 size_t |
| 218 decode(const uint8_t buf[], size_t size_max, uint64_t *num) |
| 219 { |
| 220 if (size_max == 0) |
| 221 return 0; |
| 222 |
| 223 if (size_max > 9) |
| 224 size_max = 9; |
| 225 |
| 226 *num = buf[0] & 0x7F; |
| 227 size_t i = 0; |
| 228 |
| 229 while (buf[i++] & 0x80) { |
| 230 if (i >= size_max || buf[i] == 0x00) |
| 231 return 0; |
| 232 |
| 233 *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7); |
| 234 } |
| 235 |
| 236 return i; |
| 237 } |
| 238 |
| 239 |
| 240 2. Overall Structure of .xz File |
| 241 |
| 242 A standalone .xz files consist of one or more Streams which may |
| 243 have Stream Padding between or after them: |
| 244 |
| 245 +========+================+========+================+ |
| 246 | Stream | Stream Padding | Stream | Stream Padding | ... |
| 247 +========+================+========+================+ |
| 248 |
| 249 The sizes of Stream and Stream Padding are always multiples |
| 250 of four bytes, thus the size of every valid .xz file MUST be |
| 251 a multiple of four bytes. |
| 252 |
| 253 While a typical file contains only one Stream and no Stream |
| 254 Padding, a decoder handling standalone .xz files SHOULD support |
| 255 files that have more than one Stream or Stream Padding. |
| 256 |
| 257 In contrast to standalone .xz files, when the .xz file format |
| 258 is used as an internal part of some other file format or |
| 259 communication protocol, it usually is expected that the decoder |
| 260 stops after the first Stream, and doesn't look for Stream |
| 261 Padding or possibly other Streams. |
| 262 |
| 263 |
| 264 2.1. Stream |
| 265 |
| 266 +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+ |
| 267 | Stream Header | Block | Block | ... | Block | |
| 268 +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+ |
| 269 |
| 270 +=======+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 271 ---> | Index | Stream Footer | |
| 272 +=======+-+-+-+-+-+-+-+-+-+-+-+-+ |
| 273 |
| 274 All the above fields have a size that is a multiple of four. If |
| 275 Stream is used as an internal part of another file format, it |
| 276 is RECOMMENDED to make the Stream start at an offset that is |
| 277 a multiple of four bytes. |
| 278 |
| 279 Stream Header, Index, and Stream Footer are always present in |
| 280 a Stream. The maximum size of the Index field is 16 GiB (2^34). |
| 281 |
| 282 There are zero or more Blocks. The maximum number of Blocks is |
| 283 limited only by the maximum size of the Index field. |
| 284 |
| 285 Total size of a Stream MUST be less than 8 EiB (2^63 bytes). |
| 286 The same limit applies to the total amount of uncompressed |
| 287 data stored in a Stream. |
| 288 |
| 289 If an implementation supports handling .xz files with multiple |
| 290 concatenated Streams, it MAY apply the above limits to the file |
| 291 as a whole instead of limiting per Stream basis. |
| 292 |
| 293 |
| 294 2.1.1. Stream Header |
| 295 |
| 296 +---+---+---+---+---+---+-------+------+--+--+--+--+ |
| 297 | Header Magic Bytes | Stream Flags | CRC32 | |
| 298 +---+---+---+---+---+---+-------+------+--+--+--+--+ |
| 299 |
| 300 |
| 301 2.1.1.1. Header Magic Bytes |
| 302 |
| 303 The first six (6) bytes of the Stream are so called Header |
| 304 Magic Bytes. They can be used to identify the file type. |
| 305 |
| 306 Using a C array and ASCII: |
| 307 const uint8_t HEADER_MAGIC[6] |
| 308 = { 0xFD, '7', 'z', 'X', 'Z', 0x00 }; |
| 309 |
| 310 In plain hexadecimal: |
| 311 FD 37 7A 58 5A 00 |
| 312 |
| 313 Notes: |
| 314 - The first byte (0xFD) was chosen so that the files cannot |
| 315 be erroneously detected as being in .lzma format, in which |
| 316 the first byte is in the range [0x00, 0xE0]. |
| 317 - The sixth byte (0x00) was chosen to prevent applications |
| 318 from misdetecting the file as a text file. |
| 319 |
| 320 If the Header Magic Bytes don't match, the decoder MUST |
| 321 indicate an error. |
| 322 |
| 323 |
| 324 2.1.1.2. Stream Flags |
| 325 |
| 326 The first byte of Stream Flags is always a null byte. In the |
| 327 future, this byte may be used to indicate a new Stream version |
| 328 or other Stream properties. |
| 329 |
| 330 The second byte of Stream Flags is a bit field: |
| 331 |
| 332 Bit(s) Mask Description |
| 333 0-3 0x0F Type of Check (see Section 3.4): |
| 334 ID Size Check name |
| 335 0x00 0 bytes None |
| 336 0x01 4 bytes CRC32 |
| 337 0x02 4 bytes (Reserved) |
| 338 0x03 4 bytes (Reserved) |
| 339 0x04 8 bytes CRC64 |
| 340 0x05 8 bytes (Reserved) |
| 341 0x06 8 bytes (Reserved) |
| 342 0x07 16 bytes (Reserved) |
| 343 0x08 16 bytes (Reserved) |
| 344 0x09 16 bytes (Reserved) |
| 345 0x0A 32 bytes SHA-256 |
| 346 0x0B 32 bytes (Reserved) |
| 347 0x0C 32 bytes (Reserved) |
| 348 0x0D 64 bytes (Reserved) |
| 349 0x0E 64 bytes (Reserved) |
| 350 0x0F 64 bytes (Reserved) |
| 351 4-7 0xF0 Reserved for future use; MUST be zero for now. |
| 352 |
| 353 Implementations SHOULD support at least the Check IDs 0x00 |
| 354 (None) and 0x01 (CRC32). Supporting other Check IDs is |
| 355 OPTIONAL. If an unsupported Check is used, the decoder SHOULD |
| 356 indicate a warning or error. |
| 357 |
| 358 If any reserved bit is set, the decoder MUST indicate an error. |
| 359 It is possible that there is a new field present which the |
| 360 decoder is not aware of, and can thus parse the Stream Header |
| 361 incorrectly. |
| 362 |
| 363 |
| 364 2.1.1.3. CRC32 |
| 365 |
| 366 The CRC32 is calculated from the Stream Flags field. It is |
| 367 stored as an unsigned 32-bit little endian integer. If the |
| 368 calculated value does not match the stored one, the decoder |
| 369 MUST indicate an error. |
| 370 |
| 371 The idea is that Stream Flags would always be two bytes, even |
| 372 if new features are needed. This way old decoders will be able |
| 373 to verify the CRC32 calculated from Stream Flags, and thus |
| 374 distinguish between corrupt files (CRC32 doesn't match) and |
| 375 files that the decoder doesn't support (CRC32 matches but |
| 376 Stream Flags has reserved bits set). |
| 377 |
| 378 |
| 379 2.1.2. Stream Footer |
| 380 |
| 381 +-+-+-+-+---+---+---+---+-------+------+----------+---------+ |
| 382 | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes | |
| 383 +-+-+-+-+---+---+---+---+-------+------+----------+---------+ |
| 384 |
| 385 |
| 386 2.1.2.1. CRC32 |
| 387 |
| 388 The CRC32 is calculated from the Backward Size and Stream Flags |
| 389 fields. It is stored as an unsigned 32-bit little endian |
| 390 integer. If the calculated value does not match the stored one, |
| 391 the decoder MUST indicate an error. |
| 392 |
| 393 The reason to have the CRC32 field before the Backward Size and |
| 394 Stream Flags fields is to keep the four-byte fields aligned to |
| 395 a multiple of four bytes. |
| 396 |
| 397 |
| 398 2.1.2.2. Backward Size |
| 399 |
| 400 Backward Size is stored as a 32-bit little endian integer, |
| 401 which indicates the size of the Index field as multiple of |
| 402 four bytes, minimum value being four bytes: |
| 403 |
| 404 real_backward_size = (stored_backward_size + 1) * 4; |
| 405 |
| 406 If the stored value does not match the real size of the Index |
| 407 field, the decoder MUST indicate an error. |
| 408 |
| 409 Using a fixed-size integer to store Backward Size makes |
| 410 it slightly simpler to parse the Stream Footer when the |
| 411 application needs to parse the Stream backwards. |
| 412 |
| 413 |
| 414 2.1.2.3. Stream Flags |
| 415 |
| 416 This is a copy of the Stream Flags field from the Stream |
| 417 Header. The information stored to Stream Flags is needed |
| 418 when parsing the Stream backwards. The decoder MUST compare |
| 419 the Stream Flags fields in both Stream Header and Stream |
| 420 Footer, and indicate an error if they are not identical. |
| 421 |
| 422 |
| 423 2.1.2.4. Footer Magic Bytes |
| 424 |
| 425 As the last step of the decoding process, the decoder MUST |
| 426 verify the existence of Footer Magic Bytes. If they don't |
| 427 match, an error MUST be indicated. |
| 428 |
| 429 Using a C array and ASCII: |
| 430 const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' }; |
| 431 |
| 432 In hexadecimal: |
| 433 59 5A |
| 434 |
| 435 The primary reason to have Footer Magic Bytes is to make |
| 436 it easier to detect incomplete files quickly, without |
| 437 uncompressing. If the file does not end with Footer Magic Bytes |
| 438 (excluding Stream Padding described in Section 2.2), it cannot |
| 439 be undamaged, unless someone has intentionally appended garbage |
| 440 after the end of the Stream. |
| 441 |
| 442 |
| 443 2.2. Stream Padding |
| 444 |
| 445 Only the decoders that support decoding of concatenated Streams |
| 446 MUST support Stream Padding. |
| 447 |
| 448 Stream Padding MUST contain only null bytes. To preserve the |
| 449 four-byte alignment of consecutive Streams, the size of Stream |
| 450 Padding MUST be a multiple of four bytes. Empty Stream Padding |
| 451 is allowed. If these requirements are not met, the decoder MUST |
| 452 indicate an error. |
| 453 |
| 454 Note that non-empty Stream Padding is allowed at the end of the |
| 455 file; there doesn't need to be a new Stream after non-empty |
| 456 Stream Padding. This can be convenient in certain situations |
| 457 [GNU-tar]. |
| 458 |
| 459 The possibility of Stream Padding MUST be taken into account |
| 460 when designing an application that parses Streams backwards, |
| 461 and the application supports concatenated Streams. |
| 462 |
| 463 |
| 464 3. Block |
| 465 |
| 466 +==============+=================+===============+=======+ |
| 467 | Block Header | Compressed Data | Block Padding | Check | |
| 468 +==============+=================+===============+=======+ |
| 469 |
| 470 |
| 471 3.1. Block Header |
| 472 |
| 473 +-------------------+-------------+=================+ |
| 474 | Block Header Size | Block Flags | Compressed Size | |
| 475 +-------------------+-------------+=================+ |
| 476 |
| 477 +===================+======================+ |
| 478 ---> | Uncompressed Size | List of Filter Flags | |
| 479 +===================+======================+ |
| 480 |
| 481 +================+--+--+--+--+ |
| 482 ---> | Header Padding | CRC32 | |
| 483 +================+--+--+--+--+ |
| 484 |
| 485 |
| 486 3.1.1. Block Header Size |
| 487 |
| 488 This field overlaps with the Index Indicator field (see |
| 489 Section 4.1). |
| 490 |
| 491 This field contains the size of the Block Header field, |
| 492 including the Block Header Size field itself. Valid values are |
| 493 in the range [0x01, 0xFF], which indicate the size of the Block |
| 494 Header as multiples of four bytes, minimum size being eight |
| 495 bytes: |
| 496 |
| 497 real_header_size = (encoded_header_size + 1) * 4; |
| 498 |
| 499 If a Block Header bigger than 1024 bytes is needed in the |
| 500 future, a new field can be added between the Block Header and |
| 501 Compressed Data fields. The presence of this new field would |
| 502 be indicated in the Block Header field. |
| 503 |
| 504 |
| 505 3.1.2. Block Flags |
| 506 |
| 507 The Block Flags field is a bit field: |
| 508 |
| 509 Bit(s) Mask Description |
| 510 0-1 0x03 Number of filters (1-4) |
| 511 2-5 0x3C Reserved for future use; MUST be zero for now. |
| 512 6 0x40 The Compressed Size field is present. |
| 513 7 0x80 The Uncompressed Size field is present. |
| 514 |
| 515 If any reserved bit is set, the decoder MUST indicate an error. |
| 516 It is possible that there is a new field present which the |
| 517 decoder is not aware of, and can thus parse the Block Header |
| 518 incorrectly. |
| 519 |
| 520 |
| 521 3.1.3. Compressed Size |
| 522 |
| 523 This field is present only if the appropriate bit is set in |
| 524 the Block Flags field (see Section 3.1.2). |
| 525 |
| 526 The Compressed Size field contains the size of the Compressed |
| 527 Data field, which MUST be non-zero. Compressed Size is stored |
| 528 using the encoding described in Section 1.2. If the Compressed |
| 529 Size doesn't match the size of the Compressed Data field, the |
| 530 decoder MUST indicate an error. |
| 531 |
| 532 |
| 533 3.1.4. Uncompressed Size |
| 534 |
| 535 This field is present only if the appropriate bit is set in |
| 536 the Block Flags field (see Section 3.1.2). |
| 537 |
| 538 The Uncompressed Size field contains the size of the Block |
| 539 after uncompressing. Uncompressed Size is stored using the |
| 540 encoding described in Section 1.2. If the Uncompressed Size |
| 541 does not match the real uncompressed size, the decoder MUST |
| 542 indicate an error. |
| 543 |
| 544 Storing the Compressed Size and Uncompressed Size fields serves |
| 545 several purposes: |
| 546 - The decoder knows how much memory it needs to allocate |
| 547 for a temporary buffer in multithreaded mode. |
| 548 - Simple error detection: wrong size indicates a broken file. |
| 549 - Seeking forwards to a specific location in streamed mode. |
| 550 |
| 551 It should be noted that the only reliable way to determine |
| 552 the real uncompressed size is to uncompress the Block, |
| 553 because the Block Header and Index fields may contain |
| 554 (intentionally or unintentionally) invalid information. |
| 555 |
| 556 |
| 557 3.1.5. List of Filter Flags |
| 558 |
| 559 +================+================+ +================+ |
| 560 | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags | |
| 561 +================+================+ +================+ |
| 562 |
| 563 The number of Filter Flags fields is stored in the Block Flags |
| 564 field (see Section 3.1.2). |
| 565 |
| 566 The format of each Filter Flags field is as follows: |
| 567 |
| 568 +===========+====================+===================+ |
| 569 | Filter ID | Size of Properties | Filter Properties | |
| 570 +===========+====================+===================+ |
| 571 |
| 572 Both Filter ID and Size of Properties are stored using the |
| 573 encoding described in Section 1.2. Size of Properties indicates |
| 574 the size of the Filter Properties field as bytes. The list of |
| 575 officially defined Filter IDs and the formats of their Filter |
| 576 Properties are described in Section 5.3. |
| 577 |
| 578 Filter IDs greater than or equal to 0x4000_0000_0000_0000 |
| 579 (2^62) are reserved for implementation-specific internal use. |
| 580 These Filter IDs MUST never be used in List of Filter Flags. |
| 581 |
| 582 |
| 583 3.1.6. Header Padding |
| 584 |
| 585 This field contains as many null byte as it is needed to make |
| 586 the Block Header have the size specified in Block Header Size. |
| 587 If any of the bytes are not null bytes, the decoder MUST |
| 588 indicate an error. It is possible that there is a new field |
| 589 present which the decoder is not aware of, and can thus parse |
| 590 the Block Header incorrectly. |
| 591 |
| 592 |
| 593 3.1.7. CRC32 |
| 594 |
| 595 The CRC32 is calculated over everything in the Block Header |
| 596 field except the CRC32 field itself. It is stored as an |
| 597 unsigned 32-bit little endian integer. If the calculated |
| 598 value does not match the stored one, the decoder MUST indicate |
| 599 an error. |
| 600 |
| 601 By verifying the CRC32 of the Block Header before parsing the |
| 602 actual contents allows the decoder to distinguish between |
| 603 corrupt and unsupported files. |
| 604 |
| 605 |
| 606 3.2. Compressed Data |
| 607 |
| 608 The format of Compressed Data depends on Block Flags and List |
| 609 of Filter Flags. Excluding the descriptions of the simplest |
| 610 filters in Section 5.3, the format of the filter-specific |
| 611 encoded data is out of scope of this document. |
| 612 |
| 613 |
| 614 3.3. Block Padding |
| 615 |
| 616 Block Padding MUST contain 0-3 null bytes to make the size of |
| 617 the Block a multiple of four bytes. This can be needed when |
| 618 the size of Compressed Data is not a multiple of four. If any |
| 619 of the bytes in Block Padding are not null bytes, the decoder |
| 620 MUST indicate an error. |
| 621 |
| 622 |
| 623 3.4. Check |
| 624 |
| 625 The type and size of the Check field depends on which bits |
| 626 are set in the Stream Flags field (see Section 2.1.1.2). |
| 627 |
| 628 The Check, when used, is calculated from the original |
| 629 uncompressed data. If the calculated Check does not match the |
| 630 stored one, the decoder MUST indicate an error. If the selected |
| 631 type of Check is not supported by the decoder, it SHOULD |
| 632 indicate a warning or error. |
| 633 |
| 634 |
| 635 4. Index |
| 636 |
| 637 +-----------------+===================+ |
| 638 | Index Indicator | Number of Records | |
| 639 +-----------------+===================+ |
| 640 |
| 641 +=================+===============+-+-+-+-+ |
| 642 ---> | List of Records | Index Padding | CRC32 | |
| 643 +=================+===============+-+-+-+-+ |
| 644 |
| 645 Index serves several purposes. Using it, one can |
| 646 - verify that all Blocks in a Stream have been processed; |
| 647 - find out the uncompressed size of a Stream; and |
| 648 - quickly access the beginning of any Block (random access). |
| 649 |
| 650 |
| 651 4.1. Index Indicator |
| 652 |
| 653 This field overlaps with the Block Header Size field (see |
| 654 Section 3.1.1). The value of Index Indicator is always 0x00. |
| 655 |
| 656 |
| 657 4.2. Number of Records |
| 658 |
| 659 This field indicates how many Records there are in the List |
| 660 of Records field, and thus how many Blocks there are in the |
| 661 Stream. The value is stored using the encoding described in |
| 662 Section 1.2. If the decoder has decoded all the Blocks of the |
| 663 Stream, and then notices that the Number of Records doesn't |
| 664 match the real number of Blocks, the decoder MUST indicate an |
| 665 error. |
| 666 |
| 667 |
| 668 4.3. List of Records |
| 669 |
| 670 List of Records consists of as many Records as indicated by the |
| 671 Number of Records field: |
| 672 |
| 673 +========+========+ |
| 674 | Record | Record | ... |
| 675 +========+========+ |
| 676 |
| 677 Each Record contains information about one Block: |
| 678 |
| 679 +===============+===================+ |
| 680 | Unpadded Size | Uncompressed Size | |
| 681 +===============+===================+ |
| 682 |
| 683 If the decoder has decoded all the Blocks of the Stream, it |
| 684 MUST verify that the contents of the Records match the real |
| 685 Unpadded Size and Uncompressed Size of the respective Blocks. |
| 686 |
| 687 Implementation hint: It is possible to verify the Index with |
| 688 constant memory usage by calculating for example SHA-256 of |
| 689 both the real size values and the List of Records, then |
| 690 comparing the hash values. Implementing this using |
| 691 non-cryptographic hash like CRC32 SHOULD be avoided unless |
| 692 small code size is important. |
| 693 |
| 694 If the decoder supports random-access reading, it MUST verify |
| 695 that Unpadded Size and Uncompressed Size of every completely |
| 696 decoded Block match the sizes stored in the Index. If only |
| 697 partial Block is decoded, the decoder MUST verify that the |
| 698 processed sizes don't exceed the sizes stored in the Index. |
| 699 |
| 700 |
| 701 4.3.1. Unpadded Size |
| 702 |
| 703 This field indicates the size of the Block excluding the Block |
| 704 Padding field. That is, Unpadded Size is the size of the Block |
| 705 Header, Compressed Data, and Check fields. Unpadded Size is |
| 706 stored using the encoding described in Section 1.2. The value |
| 707 MUST never be zero; with the current structure of Blocks, the |
| 708 actual minimum value for Unpadded Size is five. |
| 709 |
| 710 Implementation note: Because the size of the Block Padding |
| 711 field is not included in Unpadded Size, calculating the total |
| 712 size of a Stream or doing random-access reading requires |
| 713 calculating the actual size of the Blocks by rounding Unpadded |
| 714 Sizes up to the next multiple of four. |
| 715 |
| 716 The reason to exclude Block Padding from Unpadded Size is to |
| 717 ease making a raw copy of Compressed Data without Block |
| 718 Padding. This can be useful, for example, if someone wants |
| 719 to convert Streams to some other file format quickly. |
| 720 |
| 721 |
| 722 4.3.2. Uncompressed Size |
| 723 |
| 724 This field indicates the Uncompressed Size of the respective |
| 725 Block as bytes. The value is stored using the encoding |
| 726 described in Section 1.2. |
| 727 |
| 728 |
| 729 4.4. Index Padding |
| 730 |
| 731 This field MUST contain 0-3 null bytes to pad the Index to |
| 732 a multiple of four bytes. If any of the bytes are not null |
| 733 bytes, the decoder MUST indicate an error. |
| 734 |
| 735 |
| 736 4.5. CRC32 |
| 737 |
| 738 The CRC32 is calculated over everything in the Index field |
| 739 except the CRC32 field itself. The CRC32 is stored as an |
| 740 unsigned 32-bit little endian integer. If the calculated |
| 741 value does not match the stored one, the decoder MUST indicate |
| 742 an error. |
| 743 |
| 744 |
| 745 5. Filter Chains |
| 746 |
| 747 The Block Flags field defines how many filters are used. When |
| 748 more than one filter is used, the filters are chained; that is, |
| 749 the output of one filter is the input of another filter. The |
| 750 following figure illustrates the direction of data flow. |
| 751 |
| 752 v Uncompressed Data ^ |
| 753 | Filter 0 | |
| 754 Encoder | Filter 1 | Decoder |
| 755 | Filter n | |
| 756 v Compressed Data ^ |
| 757 |
| 758 |
| 759 5.1. Alignment |
| 760 |
| 761 Alignment of uncompressed input data is usually the job of |
| 762 the application producing the data. For example, to get the |
| 763 best results, an archiver tool should make sure that all |
| 764 PowerPC executable files in the archive stream start at |
| 765 offsets that are multiples of four bytes. |
| 766 |
| 767 Some filters, for example LZMA2, can be configured to take |
| 768 advantage of specified alignment of input data. Note that |
| 769 taking advantage of aligned input can be beneficial also when |
| 770 a filter is not the first filter in the chain. For example, |
| 771 if you compress PowerPC executables, you may want to use the |
| 772 PowerPC filter and chain that with the LZMA2 filter. Because |
| 773 not only the input but also the output alignment of the PowerPC |
| 774 filter is four bytes, it is now beneficial to set LZMA2 |
| 775 settings so that the LZMA2 encoder can take advantage of its |
| 776 four-byte-aligned input data. |
| 777 |
| 778 The output of the last filter in the chain is stored to the |
| 779 Compressed Data field, which is is guaranteed to be aligned |
| 780 to a multiple of four bytes relative to the beginning of the |
| 781 Stream. This can increase |
| 782 - speed, if the filtered data is handled multiple bytes at |
| 783 a time by the filter-specific encoder and decoder, |
| 784 because accessing aligned data in computer memory is |
| 785 usually faster; and |
| 786 - compression ratio, if the output data is later compressed |
| 787 with an external compression tool. |
| 788 |
| 789 |
| 790 5.2. Security |
| 791 |
| 792 If filters would be allowed to be chained freely, it would be |
| 793 possible to create malicious files, that would be very slow to |
| 794 decode. Such files could be used to create denial of service |
| 795 attacks. |
| 796 |
| 797 Slow files could occur when multiple filters are chained: |
| 798 |
| 799 v Compressed input data |
| 800 | Filter 1 decoder (last filter) |
| 801 | Filter 0 decoder (non-last filter) |
| 802 v Uncompressed output data |
| 803 |
| 804 The decoder of the last filter in the chain produces a lot of |
| 805 output from little input. Another filter in the chain takes the |
| 806 output of the last filter, and produces very little output |
| 807 while consuming a lot of input. As a result, a lot of data is |
| 808 moved inside the filter chain, but the filter chain as a whole |
| 809 gets very little work done. |
| 810 |
| 811 To prevent this kind of slow files, there are restrictions on |
| 812 how the filters can be chained. These restrictions MUST be |
| 813 taken into account when designing new filters. |
| 814 |
| 815 The maximum number of filters in the chain has been limited to |
| 816 four, thus there can be at maximum of three non-last filters. |
| 817 Of these three non-last filters, only two are allowed to change |
| 818 the size of the data. |
| 819 |
| 820 The non-last filters, that change the size of the data, MUST |
| 821 have a limit how much the decoder can compress the data: the |
| 822 decoder SHOULD produce at least n bytes of output when the |
| 823 filter is given 2n bytes of input. This limit is not |
| 824 absolute, but significant deviations MUST be avoided. |
| 825 |
| 826 The above limitations guarantee that if the last filter in the |
| 827 chain produces 4n bytes of output, the chain as a whole will |
| 828 produce at least n bytes of output. |
| 829 |
| 830 |
| 831 5.3. Filters |
| 832 |
| 833 5.3.1. LZMA2 |
| 834 |
| 835 LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purpose |
| 836 compression algorithm with high compression ratio and fast |
| 837 decompression. LZMA is based on LZ77 and range coding |
| 838 algorithms. |
| 839 |
| 840 LZMA2 is an extension on top of the original LZMA. LZMA2 uses |
| 841 LZMA internally, but adds support for flushing the encoder, |
| 842 uncompressed chunks, eases stateful decoder implementations, |
| 843 and improves support for multithreading. Thus, the plain LZMA |
| 844 will not be supported in this file format. |
| 845 |
| 846 Filter ID: 0x21 |
| 847 Size of Filter Properties: 1 byte |
| 848 Changes size of data: Yes |
| 849 Allow as a non-last filter: No |
| 850 Allow as the last filter: Yes |
| 851 |
| 852 Preferred alignment: |
| 853 Input data: Adjustable to 1/2/4/8/16 byte(s) |
| 854 Output data: 1 byte |
| 855 |
| 856 The format of the one-byte Filter Properties field is as |
| 857 follows: |
| 858 |
| 859 Bits Mask Description |
| 860 0-5 0x3F Dictionary Size |
| 861 6-7 0xC0 Reserved for future use; MUST be zero for now. |
| 862 |
| 863 Dictionary Size is encoded with one-bit mantissa and five-bit |
| 864 exponent. The smallest dictionary size is 4 KiB and the biggest |
| 865 is 4 GiB. |
| 866 |
| 867 Raw value Mantissa Exponent Dictionary size |
| 868 0 2 11 4 KiB |
| 869 1 3 11 6 KiB |
| 870 2 2 12 8 KiB |
| 871 3 3 12 12 KiB |
| 872 4 2 13 16 KiB |
| 873 5 3 13 24 KiB |
| 874 6 2 14 32 KiB |
| 875 ... ... ... ... |
| 876 35 3 27 768 MiB |
| 877 36 2 28 1024 MiB |
| 878 37 3 29 1536 MiB |
| 879 38 2 30 2048 MiB |
| 880 39 3 30 3072 MiB |
| 881 40 2 31 4096 MiB - 1 B |
| 882 |
| 883 Instead of having a table in the decoder, the dictionary size |
| 884 can be decoded using the following C code: |
| 885 |
| 886 const uint8_t bits = get_dictionary_flags() & 0x3F; |
| 887 if (bits > 40) |
| 888 return DICTIONARY_TOO_BIG; // Bigger than 4 GiB |
| 889 |
| 890 uint32_t dictionary_size; |
| 891 if (bits == 40) { |
| 892 dictionary_size = UINT32_MAX; |
| 893 } else { |
| 894 dictionary_size = 2 | (bits & 1); |
| 895 dictionary_size <<= bits / 2 + 11; |
| 896 } |
| 897 |
| 898 |
| 899 5.3.2. Branch/Call/Jump Filters for Executables |
| 900 |
| 901 These filters convert relative branch, call, and jump |
| 902 instructions to their absolute counterparts in executable |
| 903 files. This conversion increases redundancy and thus |
| 904 compression ratio. |
| 905 |
| 906 Size of Filter Properties: 0 or 4 bytes |
| 907 Changes size of data: No |
| 908 Allow as a non-last filter: Yes |
| 909 Allow as the last filter: No |
| 910 |
| 911 Below is the list of filters in this category. The alignment |
| 912 is the same for both input and output data. |
| 913 |
| 914 Filter ID Alignment Description |
| 915 0x04 1 byte x86 filter (BCJ) |
| 916 0x05 4 bytes PowerPC (big endian) filter |
| 917 0x06 16 bytes IA64 filter |
| 918 0x07 4 bytes ARM (little endian) filter |
| 919 0x08 2 bytes ARM Thumb (little endian) filter |
| 920 0x09 4 bytes SPARC filter |
| 921 |
| 922 If the size of Filter Properties is four bytes, the Filter |
| 923 Properties field contains the start offset used for address |
| 924 conversions. It is stored as an unsigned 32-bit little endian |
| 925 integer. The start offset MUST be a multiple of the alignment |
| 926 of the filter as listed in the table above; if it isn't, the |
| 927 decoder MUST indicate an error. If the size of Filter |
| 928 Properties is zero, the start offset is zero. |
| 929 |
| 930 Setting the start offset may be useful if an executable has |
| 931 multiple sections, and there are many cross-section calls. |
| 932 Taking advantage of this feature usually requires usage of |
| 933 the Subblock filter, whose design is not complete yet. |
| 934 |
| 935 |
| 936 5.3.3. Delta |
| 937 |
| 938 The Delta filter may increase compression ratio when the value |
| 939 of the next byte correlates with the value of an earlier byte |
| 940 at specified distance. |
| 941 |
| 942 Filter ID: 0x03 |
| 943 Size of Filter Properties: 1 byte |
| 944 Changes size of data: No |
| 945 Allow as a non-last filter: Yes |
| 946 Allow as the last filter: No |
| 947 |
| 948 Preferred alignment: |
| 949 Input data: 1 byte |
| 950 Output data: Same as the original input data |
| 951 |
| 952 The Properties byte indicates the delta distance, which can be |
| 953 1-256 bytes backwards from the current byte: 0x00 indicates |
| 954 distance of 1 byte and 0xFF distance of 256 bytes. |
| 955 |
| 956 |
| 957 5.3.3.1. Format of the Encoded Output |
| 958 |
| 959 The code below illustrates both encoding and decoding with |
| 960 the Delta filter. |
| 961 |
| 962 // Distance is in the range [1, 256]. |
| 963 const unsigned int distance = get_properties_byte() + 1; |
| 964 uint8_t pos = 0; |
| 965 uint8_t delta[256]; |
| 966 |
| 967 memset(delta, 0, sizeof(delta)); |
| 968 |
| 969 while (1) { |
| 970 const int byte = read_byte(); |
| 971 if (byte == EOF) |
| 972 break; |
| 973 |
| 974 uint8_t tmp = delta[(uint8_t)(distance + pos)]; |
| 975 if (is_encoder) { |
| 976 tmp = (uint8_t)(byte) - tmp; |
| 977 delta[pos] = (uint8_t)(byte); |
| 978 } else { |
| 979 tmp = (uint8_t)(byte) + tmp; |
| 980 delta[pos] = tmp; |
| 981 } |
| 982 |
| 983 write_byte(tmp); |
| 984 --pos; |
| 985 } |
| 986 |
| 987 |
| 988 5.4. Custom Filter IDs |
| 989 |
| 990 If a developer wants to use custom Filter IDs, he has two |
| 991 choices. The first choice is to contact Lasse Collin and ask |
| 992 him to allocate a range of IDs for the developer. |
| 993 |
| 994 The second choice is to generate a 40-bit random integer, |
| 995 which the developer can use as his personal Developer ID. |
| 996 To minimize the risk of collisions, Developer ID has to be |
| 997 a randomly generated integer, not manually selected "hex word". |
| 998 The following command, which works on many free operating |
| 999 systems, can be used to generate Developer ID: |
| 1000 |
| 1001 dd if=/dev/urandom bs=5 count=1 | hexdump |
| 1002 |
| 1003 The developer can then use his Developer ID to create unique |
| 1004 (well, hopefully unique) Filter IDs. |
| 1005 |
| 1006 Bits Mask Description |
| 1007 0-15 0x0000_0000_0000_FFFF Filter ID |
| 1008 16-55 0x00FF_FFFF_FFFF_0000 Developer ID |
| 1009 56-62 0x3F00_0000_0000_0000 Static prefix: 0x3F |
| 1010 |
| 1011 The resulting 63-bit integer will use 9 bytes of space when |
| 1012 stored using the encoding described in Section 1.2. To get |
| 1013 a shorter ID, see the beginning of this Section how to |
| 1014 request a custom ID range. |
| 1015 |
| 1016 |
| 1017 5.4.1. Reserved Custom Filter ID Ranges |
| 1018 |
| 1019 Range Description |
| 1020 0x0000_0300 - 0x0000_04FF Reserved to ease .7z compatibility |
| 1021 0x0002_0000 - 0x0007_FFFF Reserved to ease .7z compatibility |
| 1022 0x0200_0000 - 0x07FF_FFFF Reserved to ease .7z compatibility |
| 1023 |
| 1024 |
| 1025 6. Cyclic Redundancy Checks |
| 1026 |
| 1027 There are several incompatible variations to calculate CRC32 |
| 1028 and CRC64. For simplicity and clarity, complete examples are |
| 1029 provided to calculate the checks as they are used in this file |
| 1030 format. Implementations MAY use different code as long as it |
| 1031 gives identical results. |
| 1032 |
| 1033 The program below reads data from standard input, calculates |
| 1034 the CRC32 and CRC64 values, and prints the calculated values |
| 1035 as big endian hexadecimal strings to standard output. |
| 1036 |
| 1037 #include <stddef.h> |
| 1038 #include <inttypes.h> |
| 1039 #include <stdio.h> |
| 1040 |
| 1041 uint32_t crc32_table[256]; |
| 1042 uint64_t crc64_table[256]; |
| 1043 |
| 1044 void |
| 1045 init(void) |
| 1046 { |
| 1047 static const uint32_t poly32 = UINT32_C(0xEDB88320); |
| 1048 static const uint64_t poly64 |
| 1049 = UINT64_C(0xC96C5795D7870F42); |
| 1050 |
| 1051 for (size_t i = 0; i < 256; ++i) { |
| 1052 uint32_t crc32 = i; |
| 1053 uint64_t crc64 = i; |
| 1054 |
| 1055 for (size_t j = 0; j < 8; ++j) { |
| 1056 if (crc32 & 1) |
| 1057 crc32 = (crc32 >> 1) ^ poly32; |
| 1058 else |
| 1059 crc32 >>= 1; |
| 1060 |
| 1061 if (crc64 & 1) |
| 1062 crc64 = (crc64 >> 1) ^ poly64; |
| 1063 else |
| 1064 crc64 >>= 1; |
| 1065 } |
| 1066 |
| 1067 crc32_table[i] = crc32; |
| 1068 crc64_table[i] = crc64; |
| 1069 } |
| 1070 } |
| 1071 |
| 1072 uint32_t |
| 1073 crc32(const uint8_t *buf, size_t size, uint32_t crc) |
| 1074 { |
| 1075 crc = ~crc; |
| 1076 for (size_t i = 0; i < size; ++i) |
| 1077 crc = crc32_table[buf[i] ^ (crc & 0xFF)] |
| 1078 ^ (crc >> 8); |
| 1079 return ~crc; |
| 1080 } |
| 1081 |
| 1082 uint64_t |
| 1083 crc64(const uint8_t *buf, size_t size, uint64_t crc) |
| 1084 { |
| 1085 crc = ~crc; |
| 1086 for (size_t i = 0; i < size; ++i) |
| 1087 crc = crc64_table[buf[i] ^ (crc & 0xFF)] |
| 1088 ^ (crc >> 8); |
| 1089 return ~crc; |
| 1090 } |
| 1091 |
| 1092 int |
| 1093 main() |
| 1094 { |
| 1095 init(); |
| 1096 |
| 1097 uint32_t value32 = 0; |
| 1098 uint64_t value64 = 0; |
| 1099 uint64_t total_size = 0; |
| 1100 uint8_t buf[8192]; |
| 1101 |
| 1102 while (1) { |
| 1103 const size_t buf_size |
| 1104 = fread(buf, 1, sizeof(buf), stdin); |
| 1105 if (buf_size == 0) |
| 1106 break; |
| 1107 |
| 1108 total_size += buf_size; |
| 1109 value32 = crc32(buf, buf_size, value32); |
| 1110 value64 = crc64(buf, buf_size, value64); |
| 1111 } |
| 1112 |
| 1113 printf("Bytes: %" PRIu64 "\n", total_size); |
| 1114 printf("CRC-32: 0x%08" PRIX32 "\n", value32); |
| 1115 printf("CRC-64: 0x%016" PRIX64 "\n", value64); |
| 1116 |
| 1117 return 0; |
| 1118 } |
| 1119 |
| 1120 |
| 1121 7. References |
| 1122 |
| 1123 LZMA SDK - The original LZMA implementation |
| 1124 http://7-zip.org/sdk.html |
| 1125 |
| 1126 LZMA Utils - LZMA adapted to POSIX-like systems |
| 1127 http://tukaani.org/lzma/ |
| 1128 |
| 1129 XZ Utils - The next generation of LZMA Utils |
| 1130 http://tukaani.org/xz/ |
| 1131 |
| 1132 [RFC-1952] |
| 1133 GZIP file format specification version 4.3 |
| 1134 http://www.ietf.org/rfc/rfc1952.txt |
| 1135 - Notation of byte boxes in section "2.1. Overall conventions" |
| 1136 |
| 1137 [RFC-2119] |
| 1138 Key words for use in RFCs to Indicate Requirement Levels |
| 1139 http://www.ietf.org/rfc/rfc2119.txt |
| 1140 |
| 1141 [GNU-tar] |
| 1142 GNU tar 1.21 manual |
| 1143 http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html |
| 1144 - Node 9.4.2 "Blocking Factor", paragraph that begins |
| 1145 "gzip will complain about trailing garbage" |
| 1146 - Note that this URL points to the latest version of the |
| 1147 manual, and may some day not contain the note which is in |
| 1148 1.21. For the exact version of the manual, download GNU |
| 1149 tar 1.21: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.21.tar.gz |
| 1150 |
OLD | NEW |