xz/doc/xz-file-format.txt - Issue 2869016: Add an unpatched version of xz, XZ Utils, to /trunk/deps/third_party

Side by Side Diff: xz/doc/xz-file-format.txt

Issue 2869016: Add an unpatched version of xz, XZ Utils, to /trunk/deps/third_party (Closed) Base URL: svn://svn.chromium.org/chrome/trunk/deps/third_party/

Patch Set: Created 10 years, 6 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch | Annotate | Revision Log

Property Changes:

Added: svn:eol-style
+ LF

OLD	NEW
(Empty)
	1

	2 The .xz File Format

	3 ===================

	4

	5 Version 1.0.4 (2009-08-27)

	6

	7

	8 0. Preface

	9 0.1. Notices and Acknowledgements

	10 0.2. Getting the Latest Version

	11 0.3. Version History

	12 1. Conventions

	13 1.1. Byte and Its Representation

	14 1.2. Multibyte Integers

	15 2. Overall Structure of .xz File

	16 2.1. Stream

	17 2.1.1. Stream Header

	18 2.1.1.1. Header Magic Bytes

	19 2.1.1.2. Stream Flags

	20 2.1.1.3. CRC32

	21 2.1.2. Stream Footer

	22 2.1.2.1. CRC32

	23 2.1.2.2. Backward Size

	24 2.1.2.3. Stream Flags

	25 2.1.2.4. Footer Magic Bytes

	26 2.2. Stream Padding

	27 3. Block

	28 3.1. Block Header

	29 3.1.1. Block Header Size

	30 3.1.2. Block Flags

	31 3.1.3. Compressed Size

	32 3.1.4. Uncompressed Size

	33 3.1.5. List of Filter Flags

	34 3.1.6. Header Padding

	35 3.1.7. CRC32

	36 3.2. Compressed Data

	37 3.3. Block Padding

	38 3.4. Check

	39 4. Index

	40 4.1. Index Indicator

	41 4.2. Number of Records

	42 4.3. List of Records

	43 4.3.1. Unpadded Size

	44 4.3.2. Uncompressed Size

	45 4.4. Index Padding

	46 4.5. CRC32

	47 5. Filter Chains

	48 5.1. Alignment

	49 5.2. Security

	50 5.3. Filters

	51 5.3.1. LZMA2

	52 5.3.2. Branch/Call/Jump Filters for Executables

	53 5.3.3. Delta

	54 5.3.3.1. Format of the Encoded Output

	55 5.4. Custom Filter IDs

	56 5.4.1. Reserved Custom Filter ID Ranges

	57 6. Cyclic Redundancy Checks

	58 7. References

	59

	60

	61 0. Preface

	62

	63 This document describes the .xz file format (filename suffix

	64 ".xz", MIME type "application/x-xz"). It is intended that this

	65 this format replace the old .lzma format used by LZMA SDK and

	66 LZMA Utils.

	67

	68

	69 0.1. Notices and Acknowledgements

	70

	71 This file format was designed by Lasse Collin

	72 <lasse.collin@tukaani.org> and Igor Pavlov.

	73

	74 Special thanks for helping with this document goes to

	75 Ville Koskinen. Thanks for helping with this document goes to

	76 Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius.

	77

	78 This document has been put into the public domain.

	79

	80

	81 0.2. Getting the Latest Version

	82

	83 The latest official version of this document can be downloaded

	84 from <http://tukaani.org/xz/xz-file-format.txt>.

	85

	86 Specific versions of this document have a filename

	87 xz-file-format-X.Y.Z.txt where X.Y.Z is the version number.

	88 For example, the version 1.0.0 of this document is available

	89 at <http://tukaani.org/xz/xz-file-format-1.0.0.txt>.

	90

	91

	92 0.3. Version History

	93

	94 Version Date Description

	95

	96 1.0.4 2009-08-27 Language improvements in Sections 1.2,

	97 2.1.1.2, 3.1.1, 3.1.2, and 5.3.1

	98

	99 1.0.3 2009-06-05 Spelling fixes in Sections 5.1 and 5.4

	100

	101 1.0.2 2009-06-04 Typo fixes in Sections 4 and 5.3.1

	102

	103 1.0.1 2009-06-01 Typo fix in Section 0.3 and minor

	104 clarifications to Sections 2, 2.2,

	105 3.3, 4.4, and 5.3.2

	106

	107 1.0.0 2009-01-14 The first official version

	108

	109

	110 1. Conventions

	111

	112 The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD",

	113 "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this

	114 document are to be interpreted as described in [RFC-2119].

	115

	116 Indicating a warning means displaying a message, returning

	117 appropriate exit status, or doing something else to let the

	118 user know that something worth warning occurred. The operation

	119 SHOULD still finish if a warning is indicated.

	120

	121 Indicating an error means displaying a message, returning

	122 appropriate exit status, or doing something else to let the

	123 user know that something prevented successfully finishing the

	124 operation. The operation MUST be aborted once an error has

	125 been indicated.

	126

	127

	128 1.1. Byte and Its Representation

	129

	130 In this document, byte is always 8 bits.

	131

	132 A "null byte" has all bits unset. That is, the value of a null

	133 byte is 0x00.

	134

	135 To represent byte blocks, this document uses notation that

	136 is similar to the notation used in [RFC-1952]:

	137

	138 +-------+

	139 \| Foo \| One byte.

	140 +-------+

	141

	142 +---+---+

	143 \| Foo \| Two bytes; that is, some of the vertical bars

	144 +---+---+ can be missing.

	145

	146 +=======+

	147 \| Foo \| Zero or more bytes.

	148 +=======+

	149

	150 In this document, a boxed byte or a byte sequence declared

	151 using this notation is called "a field". The example field

	152 above would be called "the Foo field" or plain "Foo".

	153

	154 If there are many fields, they may be split to multiple lines.

	155 This is indicated with an arrow ("--->"):

	156

	157 +=====+

	158 \| Foo \|

	159 +=====+

	160

	161 +=====+

	162 ---> \| Bar \|

	163 +=====+

	164

	165 The above is equivalent to this:

	166

	167 +=====+=====+

	168 \| Foo \| Bar \|

	169 +=====+=====+

	170

	171

	172 1.2. Multibyte Integers

	173

	174 Multibyte integers of static length, such as CRC values,

	175 are stored in little endian byte order (least significant

	176 byte first).

	177

	178 When smaller values are more likely than bigger values (for

	179 example file sizes), multibyte integers are encoded in a

	180 variable-length representation:

	181 - Numbers in the range [0, 127] are copied as is, and take

	182 one byte of space.

	183 - Bigger numbers will occupy two or more bytes. All but the

	184 last byte of the multibyte representation have the highest

	185 (eighth) bit set.

	186

	187 For now, the value of the variable-length integers is limited

	188 to 63 bits, which limits the encoded size of the integer to

	189 nine bytes. These limits may be increased in the future if

	190 needed.

	191

	192 The following C code illustrates encoding and decoding of

	193 variable-length integers. The functions return the number of

	194 bytes occupied by the integer (1-9), or zero on error.

	195

	196 #include <stddef.h>

	197 #include <inttypes.h>

	198

	199 size_t

	200 encode(uint8_t buf[static 9], uint64_t num)

	201 {

	202 if (num > UINT64_MAX / 2)

	203 return 0;

	204

	205 size_t i = 0;

	206

	207 while (num >= 0x80) {

	208 buf[i++] = (uint8_t)(num) \| 0x80;

	209 num >>= 7;

	210 }

	211

	212 buf[i++] = (uint8_t)(num);

	213

	214 return i;

	215 }

	216

	217 size_t

	218 decode(const uint8_t buf[], size_t size_max, uint64_t *num)

	219 {

	220 if (size_max == 0)

	221 return 0;

	222

	223 if (size_max > 9)

	224 size_max = 9;

	225

	226 *num = buf[0] & 0x7F;

	227 size_t i = 0;

	228

	229 while (buf[i++] & 0x80) {

	230 if (i >= size_max \|\| buf[i] == 0x00)

	231 return 0;

	232

	233 num \|= (uint64_t)(buf[i] & 0x7F) << (i 7);

	234 }

	235

	236 return i;

	237 }

	238

	239

	240 2. Overall Structure of .xz File

	241

	242 A standalone .xz files consist of one or more Streams which may

	243 have Stream Padding between or after them:

	244

	245 +========+================+========+================+

	246 \| Stream \| Stream Padding \| Stream \| Stream Padding \| ...

	247 +========+================+========+================+

	248

	249 The sizes of Stream and Stream Padding are always multiples

	250 of four bytes, thus the size of every valid .xz file MUST be

	251 a multiple of four bytes.

	252

	253 While a typical file contains only one Stream and no Stream

	254 Padding, a decoder handling standalone .xz files SHOULD support

	255 files that have more than one Stream or Stream Padding.

	256

	257 In contrast to standalone .xz files, when the .xz file format

	258 is used as an internal part of some other file format or

	259 communication protocol, it usually is expected that the decoder

	260 stops after the first Stream, and doesn't look for Stream

	261 Padding or possibly other Streams.

	262

	263

	264 2.1. Stream

	265

	266 +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+

	267 \| Stream Header \| Block \| Block \| ... \| Block \|

	268 +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+

	269

	270 +=======+-+-+-+-+-+-+-+-+-+-+-+-+

	271 ---> \| Index \| Stream Footer \|

	272 +=======+-+-+-+-+-+-+-+-+-+-+-+-+

	273

	274 All the above fields have a size that is a multiple of four. If

	275 Stream is used as an internal part of another file format, it

	276 is RECOMMENDED to make the Stream start at an offset that is

	277 a multiple of four bytes.

	278

	279 Stream Header, Index, and Stream Footer are always present in

	280 a Stream. The maximum size of the Index field is 16 GiB (2^34).

	281

	282 There are zero or more Blocks. The maximum number of Blocks is

	283 limited only by the maximum size of the Index field.

	284

	285 Total size of a Stream MUST be less than 8 EiB (2^63 bytes).

	286 The same limit applies to the total amount of uncompressed

	287 data stored in a Stream.

	288

	289 If an implementation supports handling .xz files with multiple

	290 concatenated Streams, it MAY apply the above limits to the file

	291 as a whole instead of limiting per Stream basis.

	292

	293

	294 2.1.1. Stream Header

	295

	296 +---+---+---+---+---+---+-------+------+--+--+--+--+

	297 \| Header Magic Bytes \| Stream Flags \| CRC32 \|

	298 +---+---+---+---+---+---+-------+------+--+--+--+--+

	299

	300

	301 2.1.1.1. Header Magic Bytes

	302

	303 The first six (6) bytes of the Stream are so called Header

	304 Magic Bytes. They can be used to identify the file type.

	305

	306 Using a C array and ASCII:

	307 const uint8_t HEADER_MAGIC[6]

	308 = { 0xFD, '7', 'z', 'X', 'Z', 0x00 };

	309

	310 In plain hexadecimal:

	311 FD 37 7A 58 5A 00

	312

	313 Notes:

	314 - The first byte (0xFD) was chosen so that the files cannot

	315 be erroneously detected as being in .lzma format, in which

	316 the first byte is in the range [0x00, 0xE0].

	317 - The sixth byte (0x00) was chosen to prevent applications

	318 from misdetecting the file as a text file.

	319

	320 If the Header Magic Bytes don't match, the decoder MUST

	321 indicate an error.

	322

	323

	324 2.1.1.2. Stream Flags

	325

	326 The first byte of Stream Flags is always a null byte. In the

	327 future, this byte may be used to indicate a new Stream version

	328 or other Stream properties.

	329

	330 The second byte of Stream Flags is a bit field:

	331

	332 Bit(s) Mask Description

	333 0-3 0x0F Type of Check (see Section 3.4):

	334 ID Size Check name

	335 0x00 0 bytes None

	336 0x01 4 bytes CRC32

	337 0x02 4 bytes (Reserved)

	338 0x03 4 bytes (Reserved)

	339 0x04 8 bytes CRC64

	340 0x05 8 bytes (Reserved)

	341 0x06 8 bytes (Reserved)

	342 0x07 16 bytes (Reserved)

	343 0x08 16 bytes (Reserved)

	344 0x09 16 bytes (Reserved)

	345 0x0A 32 bytes SHA-256

	346 0x0B 32 bytes (Reserved)

	347 0x0C 32 bytes (Reserved)

	348 0x0D 64 bytes (Reserved)

	349 0x0E 64 bytes (Reserved)

	350 0x0F 64 bytes (Reserved)

	351 4-7 0xF0 Reserved for future use; MUST be zero for now.

	352

	353 Implementations SHOULD support at least the Check IDs 0x00

	354 (None) and 0x01 (CRC32). Supporting other Check IDs is

	355 OPTIONAL. If an unsupported Check is used, the decoder SHOULD

	356 indicate a warning or error.

	357

	358 If any reserved bit is set, the decoder MUST indicate an error.

	359 It is possible that there is a new field present which the

	360 decoder is not aware of, and can thus parse the Stream Header

	361 incorrectly.

	362

	363

	364 2.1.1.3. CRC32

	365

	366 The CRC32 is calculated from the Stream Flags field. It is

	367 stored as an unsigned 32-bit little endian integer. If the

	368 calculated value does not match the stored one, the decoder

	369 MUST indicate an error.

	370

	371 The idea is that Stream Flags would always be two bytes, even

	372 if new features are needed. This way old decoders will be able

	373 to verify the CRC32 calculated from Stream Flags, and thus

	374 distinguish between corrupt files (CRC32 doesn't match) and

	375 files that the decoder doesn't support (CRC32 matches but

	376 Stream Flags has reserved bits set).

	377

	378

	379 2.1.2. Stream Footer

	380

	381 +-+-+-+-+---+---+---+---+-------+------+----------+---------+

	382 \| CRC32 \| Backward Size \| Stream Flags \| Footer Magic Bytes \|

	383 +-+-+-+-+---+---+---+---+-------+------+----------+---------+

	384

	385

	386 2.1.2.1. CRC32

	387

	388 The CRC32 is calculated from the Backward Size and Stream Flags

	389 fields. It is stored as an unsigned 32-bit little endian

	390 integer. If the calculated value does not match the stored one,

	391 the decoder MUST indicate an error.

	392

	393 The reason to have the CRC32 field before the Backward Size and

	394 Stream Flags fields is to keep the four-byte fields aligned to

	395 a multiple of four bytes.

	396

	397

	398 2.1.2.2. Backward Size

	399

	400 Backward Size is stored as a 32-bit little endian integer,

	401 which indicates the size of the Index field as multiple of

	402 four bytes, minimum value being four bytes:

	403

	404 real_backward_size = (stored_backward_size + 1) * 4;

	405

	406 If the stored value does not match the real size of the Index

	407 field, the decoder MUST indicate an error.

	408

	409 Using a fixed-size integer to store Backward Size makes

	410 it slightly simpler to parse the Stream Footer when the

	411 application needs to parse the Stream backwards.

	412

	413

	414 2.1.2.3. Stream Flags

	415

	416 This is a copy of the Stream Flags field from the Stream

	417 Header. The information stored to Stream Flags is needed

	418 when parsing the Stream backwards. The decoder MUST compare

	419 the Stream Flags fields in both Stream Header and Stream

	420 Footer, and indicate an error if they are not identical.

	421

	422

	423 2.1.2.4. Footer Magic Bytes

	424

	425 As the last step of the decoding process, the decoder MUST

	426 verify the existence of Footer Magic Bytes. If they don't

	427 match, an error MUST be indicated.

	428

	429 Using a C array and ASCII:

	430 const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };

	431

	432 In hexadecimal:

	433 59 5A

	434

	435 The primary reason to have Footer Magic Bytes is to make

	436 it easier to detect incomplete files quickly, without

	437 uncompressing. If the file does not end with Footer Magic Bytes

	438 (excluding Stream Padding described in Section 2.2), it cannot

	439 be undamaged, unless someone has intentionally appended garbage

	440 after the end of the Stream.

	441

	442

	443 2.2. Stream Padding

	444

	445 Only the decoders that support decoding of concatenated Streams

	446 MUST support Stream Padding.

	447

	448 Stream Padding MUST contain only null bytes. To preserve the

	449 four-byte alignment of consecutive Streams, the size of Stream

	450 Padding MUST be a multiple of four bytes. Empty Stream Padding

	451 is allowed. If these requirements are not met, the decoder MUST

	452 indicate an error.

	453

	454 Note that non-empty Stream Padding is allowed at the end of the

	455 file; there doesn't need to be a new Stream after non-empty

	456 Stream Padding. This can be convenient in certain situations

	457 [GNU-tar].

	458

	459 The possibility of Stream Padding MUST be taken into account

	460 when designing an application that parses Streams backwards,

	461 and the application supports concatenated Streams.

	462

	463

	464 3. Block

	465

	466 +==============+=================+===============+=======+

	467 \| Block Header \| Compressed Data \| Block Padding \| Check \|

	468 +==============+=================+===============+=======+

	469

	470

	471 3.1. Block Header

	472

	473 +-------------------+-------------+=================+

	474 \| Block Header Size \| Block Flags \| Compressed Size \|

	475 +-------------------+-------------+=================+

	476

	477 +===================+======================+

	478 ---> \| Uncompressed Size \| List of Filter Flags \|

	479 +===================+======================+

	480

	481 +================+--+--+--+--+

	482 ---> \| Header Padding \| CRC32 \|

	483 +================+--+--+--+--+

	484

	485

	486 3.1.1. Block Header Size

	487

	488 This field overlaps with the Index Indicator field (see

	489 Section 4.1).

	490

	491 This field contains the size of the Block Header field,

	492 including the Block Header Size field itself. Valid values are

	493 in the range [0x01, 0xFF], which indicate the size of the Block

	494 Header as multiples of four bytes, minimum size being eight

	495 bytes:

	496

	497 real_header_size = (encoded_header_size + 1) * 4;

	498

	499 If a Block Header bigger than 1024 bytes is needed in the

	500 future, a new field can be added between the Block Header and

	501 Compressed Data fields. The presence of this new field would

	502 be indicated in the Block Header field.

	503

	504

	505 3.1.2. Block Flags

	506

	507 The Block Flags field is a bit field:

	508

	509 Bit(s) Mask Description

	510 0-1 0x03 Number of filters (1-4)

	511 2-5 0x3C Reserved for future use; MUST be zero for now.

	512 6 0x40 The Compressed Size field is present.

	513 7 0x80 The Uncompressed Size field is present.

	514

	515 If any reserved bit is set, the decoder MUST indicate an error.

	516 It is possible that there is a new field present which the

	517 decoder is not aware of, and can thus parse the Block Header

	518 incorrectly.

	519

	520

	521 3.1.3. Compressed Size

	522

	523 This field is present only if the appropriate bit is set in

	524 the Block Flags field (see Section 3.1.2).

	525

	526 The Compressed Size field contains the size of the Compressed

	527 Data field, which MUST be non-zero. Compressed Size is stored

	528 using the encoding described in Section 1.2. If the Compressed

	529 Size doesn't match the size of the Compressed Data field, the

	530 decoder MUST indicate an error.

	531

	532

	533 3.1.4. Uncompressed Size

	534

	535 This field is present only if the appropriate bit is set in

	536 the Block Flags field (see Section 3.1.2).

	537

	538 The Uncompressed Size field contains the size of the Block

	539 after uncompressing. Uncompressed Size is stored using the

	540 encoding described in Section 1.2. If the Uncompressed Size

	541 does not match the real uncompressed size, the decoder MUST

	542 indicate an error.

	543

	544 Storing the Compressed Size and Uncompressed Size fields serves

	545 several purposes:

	546 - The decoder knows how much memory it needs to allocate

	547 for a temporary buffer in multithreaded mode.

	548 - Simple error detection: wrong size indicates a broken file.

	549 - Seeking forwards to a specific location in streamed mode.

	550

	551 It should be noted that the only reliable way to determine

	552 the real uncompressed size is to uncompress the Block,

	553 because the Block Header and Index fields may contain

	554 (intentionally or unintentionally) invalid information.

	555

	556

	557 3.1.5. List of Filter Flags

	558

	559 +================+================+ +================+

	560 \| Filter 0 Flags \| Filter 1 Flags \| ... \| Filter n Flags \|

	561 +================+================+ +================+

	562

	563 The number of Filter Flags fields is stored in the Block Flags

	564 field (see Section 3.1.2).

	565

	566 The format of each Filter Flags field is as follows:

	567

	568 +===========+====================+===================+

	569 \| Filter ID \| Size of Properties \| Filter Properties \|

	570 +===========+====================+===================+

	571

	572 Both Filter ID and Size of Properties are stored using the

	573 encoding described in Section 1.2. Size of Properties indicates

	574 the size of the Filter Properties field as bytes. The list of

	575 officially defined Filter IDs and the formats of their Filter

	576 Properties are described in Section 5.3.

	577

	578 Filter IDs greater than or equal to 0x4000_0000_0000_0000

	579 (2^62) are reserved for implementation-specific internal use.

	580 These Filter IDs MUST never be used in List of Filter Flags.

	581

	582

	583 3.1.6. Header Padding

	584

	585 This field contains as many null byte as it is needed to make

	586 the Block Header have the size specified in Block Header Size.

	587 If any of the bytes are not null bytes, the decoder MUST

	588 indicate an error. It is possible that there is a new field

	589 present which the decoder is not aware of, and can thus parse

	590 the Block Header incorrectly.

	591

	592

	593 3.1.7. CRC32

	594

	595 The CRC32 is calculated over everything in the Block Header

	596 field except the CRC32 field itself. It is stored as an

	597 unsigned 32-bit little endian integer. If the calculated

	598 value does not match the stored one, the decoder MUST indicate

	599 an error.

	600

	601 By verifying the CRC32 of the Block Header before parsing the

	602 actual contents allows the decoder to distinguish between

	603 corrupt and unsupported files.

	604

	605

	606 3.2. Compressed Data

	607

	608 The format of Compressed Data depends on Block Flags and List

	609 of Filter Flags. Excluding the descriptions of the simplest

	610 filters in Section 5.3, the format of the filter-specific

	611 encoded data is out of scope of this document.

	612

	613

	614 3.3. Block Padding

	615

	616 Block Padding MUST contain 0-3 null bytes to make the size of

	617 the Block a multiple of four bytes. This can be needed when

	618 the size of Compressed Data is not a multiple of four. If any

	619 of the bytes in Block Padding are not null bytes, the decoder

	620 MUST indicate an error.

	621

	622

	623 3.4. Check

	624

	625 The type and size of the Check field depends on which bits

	626 are set in the Stream Flags field (see Section 2.1.1.2).

	627

	628 The Check, when used, is calculated from the original

	629 uncompressed data. If the calculated Check does not match the

	630 stored one, the decoder MUST indicate an error. If the selected

	631 type of Check is not supported by the decoder, it SHOULD

	632 indicate a warning or error.

	633

	634

	635 4. Index

	636

	637 +-----------------+===================+

	638 \| Index Indicator \| Number of Records \|

	639 +-----------------+===================+

	640

	641 +=================+===============+-+-+-+-+

	642 ---> \| List of Records \| Index Padding \| CRC32 \|

	643 +=================+===============+-+-+-+-+

	644

	645 Index serves several purposes. Using it, one can

	646 - verify that all Blocks in a Stream have been processed;

	647 - find out the uncompressed size of a Stream; and

	648 - quickly access the beginning of any Block (random access).

	649

	650

	651 4.1. Index Indicator

	652

	653 This field overlaps with the Block Header Size field (see

	654 Section 3.1.1). The value of Index Indicator is always 0x00.

	655

	656

	657 4.2. Number of Records

	658

	659 This field indicates how many Records there are in the List

	660 of Records field, and thus how many Blocks there are in the

	661 Stream. The value is stored using the encoding described in

	662 Section 1.2. If the decoder has decoded all the Blocks of the

	663 Stream, and then notices that the Number of Records doesn't

	664 match the real number of Blocks, the decoder MUST indicate an

	665 error.

	666

	667

	668 4.3. List of Records

	669

	670 List of Records consists of as many Records as indicated by the

	671 Number of Records field:

	672

	673 +========+========+

	674 \| Record \| Record \| ...

	675 +========+========+

	676

	677 Each Record contains information about one Block:

	678

	679 +===============+===================+

	680 \| Unpadded Size \| Uncompressed Size \|

	681 +===============+===================+

	682

	683 If the decoder has decoded all the Blocks of the Stream, it

	684 MUST verify that the contents of the Records match the real

	685 Unpadded Size and Uncompressed Size of the respective Blocks.

	686

	687 Implementation hint: It is possible to verify the Index with

	688 constant memory usage by calculating for example SHA-256 of

	689 both the real size values and the List of Records, then

	690 comparing the hash values. Implementing this using

	691 non-cryptographic hash like CRC32 SHOULD be avoided unless

	692 small code size is important.

	693

	694 If the decoder supports random-access reading, it MUST verify

	695 that Unpadded Size and Uncompressed Size of every completely

	696 decoded Block match the sizes stored in the Index. If only

	697 partial Block is decoded, the decoder MUST verify that the

	698 processed sizes don't exceed the sizes stored in the Index.

	699

	700

	701 4.3.1. Unpadded Size

	702

	703 This field indicates the size of the Block excluding the Block

	704 Padding field. That is, Unpadded Size is the size of the Block

	705 Header, Compressed Data, and Check fields. Unpadded Size is

	706 stored using the encoding described in Section 1.2. The value

	707 MUST never be zero; with the current structure of Blocks, the

	708 actual minimum value for Unpadded Size is five.

	709

	710 Implementation note: Because the size of the Block Padding

	711 field is not included in Unpadded Size, calculating the total

	712 size of a Stream or doing random-access reading requires

	713 calculating the actual size of the Blocks by rounding Unpadded

	714 Sizes up to the next multiple of four.

	715

	716 The reason to exclude Block Padding from Unpadded Size is to

	717 ease making a raw copy of Compressed Data without Block

	718 Padding. This can be useful, for example, if someone wants

	719 to convert Streams to some other file format quickly.

	720

	721

	722 4.3.2. Uncompressed Size

	723

	724 This field indicates the Uncompressed Size of the respective

	725 Block as bytes. The value is stored using the encoding

	726 described in Section 1.2.

	727

	728

	729 4.4. Index Padding

	730

	731 This field MUST contain 0-3 null bytes to pad the Index to

	732 a multiple of four bytes. If any of the bytes are not null

	733 bytes, the decoder MUST indicate an error.

	734

	735

	736 4.5. CRC32

	737

	738 The CRC32 is calculated over everything in the Index field

	739 except the CRC32 field itself. The CRC32 is stored as an

	740 unsigned 32-bit little endian integer. If the calculated

	741 value does not match the stored one, the decoder MUST indicate

	742 an error.

	743

	744

	745 5. Filter Chains

	746

	747 The Block Flags field defines how many filters are used. When

	748 more than one filter is used, the filters are chained; that is,

	749 the output of one filter is the input of another filter. The

	750 following figure illustrates the direction of data flow.

	751

	752 v Uncompressed Data ^

	753 \| Filter 0 \|

	754 Encoder \| Filter 1 \| Decoder

	755 \| Filter n \|

	756 v Compressed Data ^

	757

	758

	759 5.1. Alignment

	760

	761 Alignment of uncompressed input data is usually the job of

	762 the application producing the data. For example, to get the

	763 best results, an archiver tool should make sure that all

	764 PowerPC executable files in the archive stream start at

	765 offsets that are multiples of four bytes.

	766

	767 Some filters, for example LZMA2, can be configured to take

	768 advantage of specified alignment of input data. Note that

	769 taking advantage of aligned input can be beneficial also when

	770 a filter is not the first filter in the chain. For example,

	771 if you compress PowerPC executables, you may want to use the

	772 PowerPC filter and chain that with the LZMA2 filter. Because

	773 not only the input but also the output alignment of the PowerPC

	774 filter is four bytes, it is now beneficial to set LZMA2

	775 settings so that the LZMA2 encoder can take advantage of its

	776 four-byte-aligned input data.

	777

	778 The output of the last filter in the chain is stored to the

	779 Compressed Data field, which is is guaranteed to be aligned

	780 to a multiple of four bytes relative to the beginning of the

	781 Stream. This can increase

	782 - speed, if the filtered data is handled multiple bytes at

	783 a time by the filter-specific encoder and decoder,

	784 because accessing aligned data in computer memory is

	785 usually faster; and

	786 - compression ratio, if the output data is later compressed

	787 with an external compression tool.

	788

	789

	790 5.2. Security

	791

	792 If filters would be allowed to be chained freely, it would be

	793 possible to create malicious files, that would be very slow to

	794 decode. Such files could be used to create denial of service

	795 attacks.

	796

	797 Slow files could occur when multiple filters are chained:

	798

	799 v Compressed input data

	800 \| Filter 1 decoder (last filter)

	801 \| Filter 0 decoder (non-last filter)

	802 v Uncompressed output data

	803

	804 The decoder of the last filter in the chain produces a lot of

	805 output from little input. Another filter in the chain takes the

	806 output of the last filter, and produces very little output

	807 while consuming a lot of input. As a result, a lot of data is

	808 moved inside the filter chain, but the filter chain as a whole

	809 gets very little work done.

	810

	811 To prevent this kind of slow files, there are restrictions on

	812 how the filters can be chained. These restrictions MUST be

	813 taken into account when designing new filters.

	814

	815 The maximum number of filters in the chain has been limited to

	816 four, thus there can be at maximum of three non-last filters.

	817 Of these three non-last filters, only two are allowed to change

	818 the size of the data.

	819

	820 The non-last filters, that change the size of the data, MUST

	821 have a limit how much the decoder can compress the data: the

	822 decoder SHOULD produce at least n bytes of output when the

	823 filter is given 2n bytes of input. This limit is not

	824 absolute, but significant deviations MUST be avoided.

	825

	826 The above limitations guarantee that if the last filter in the

	827 chain produces 4n bytes of output, the chain as a whole will

	828 produce at least n bytes of output.

	829

	830

	831 5.3. Filters

	832

	833 5.3.1. LZMA2

	834

	835 LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purpose

	836 compression algorithm with high compression ratio and fast

	837 decompression. LZMA is based on LZ77 and range coding

	838 algorithms.

	839

	840 LZMA2 is an extension on top of the original LZMA. LZMA2 uses

	841 LZMA internally, but adds support for flushing the encoder,

	842 uncompressed chunks, eases stateful decoder implementations,

	843 and improves support for multithreading. Thus, the plain LZMA

	844 will not be supported in this file format.

	845

	846 Filter ID: 0x21

	847 Size of Filter Properties: 1 byte

	848 Changes size of data: Yes

	849 Allow as a non-last filter: No

	850 Allow as the last filter: Yes

	851

	852 Preferred alignment:

	853 Input data: Adjustable to 1/2/4/8/16 byte(s)

	854 Output data: 1 byte

	855

	856 The format of the one-byte Filter Properties field is as

	857 follows:

	858

	859 Bits Mask Description

	860 0-5 0x3F Dictionary Size

	861 6-7 0xC0 Reserved for future use; MUST be zero for now.

	862

	863 Dictionary Size is encoded with one-bit mantissa and five-bit

	864 exponent. The smallest dictionary size is 4 KiB and the biggest

	865 is 4 GiB.

	866

	867 Raw value Mantissa Exponent Dictionary size

	868 0 2 11 4 KiB

	869 1 3 11 6 KiB

	870 2 2 12 8 KiB

	871 3 3 12 12 KiB

	872 4 2 13 16 KiB

	873 5 3 13 24 KiB

	874 6 2 14 32 KiB

	875 ... ... ... ...

	876 35 3 27 768 MiB

	877 36 2 28 1024 MiB

	878 37 3 29 1536 MiB

	879 38 2 30 2048 MiB

	880 39 3 30 3072 MiB

	881 40 2 31 4096 MiB - 1 B

	882

	883 Instead of having a table in the decoder, the dictionary size

	884 can be decoded using the following C code:

	885

	886 const uint8_t bits = get_dictionary_flags() & 0x3F;

	887 if (bits > 40)

	888 return DICTIONARY_TOO_BIG; // Bigger than 4 GiB

	889

	890 uint32_t dictionary_size;

	891 if (bits == 40) {

	892 dictionary_size = UINT32_MAX;

	893 } else {

	894 dictionary_size = 2 \| (bits & 1);

	895 dictionary_size <<= bits / 2 + 11;

	896 }

	897

	898

	899 5.3.2. Branch/Call/Jump Filters for Executables

	900

	901 These filters convert relative branch, call, and jump

	902 instructions to their absolute counterparts in executable

	903 files. This conversion increases redundancy and thus

	904 compression ratio.

	905

	906 Size of Filter Properties: 0 or 4 bytes

	907 Changes size of data: No

	908 Allow as a non-last filter: Yes

	909 Allow as the last filter: No

	910

	911 Below is the list of filters in this category. The alignment

	912 is the same for both input and output data.

	913

	914 Filter ID Alignment Description

	915 0x04 1 byte x86 filter (BCJ)

	916 0x05 4 bytes PowerPC (big endian) filter

	917 0x06 16 bytes IA64 filter

	918 0x07 4 bytes ARM (little endian) filter

	919 0x08 2 bytes ARM Thumb (little endian) filter

	920 0x09 4 bytes SPARC filter

	921

	922 If the size of Filter Properties is four bytes, the Filter

	923 Properties field contains the start offset used for address

	924 conversions. It is stored as an unsigned 32-bit little endian

	925 integer. The start offset MUST be a multiple of the alignment

	926 of the filter as listed in the table above; if it isn't, the

	927 decoder MUST indicate an error. If the size of Filter

	928 Properties is zero, the start offset is zero.

	929

	930 Setting the start offset may be useful if an executable has

	931 multiple sections, and there are many cross-section calls.

	932 Taking advantage of this feature usually requires usage of

	933 the Subblock filter, whose design is not complete yet.

	934

	935

	936 5.3.3. Delta

	937

	938 The Delta filter may increase compression ratio when the value

	939 of the next byte correlates with the value of an earlier byte

	940 at specified distance.

	941

	942 Filter ID: 0x03

	943 Size of Filter Properties: 1 byte

	944 Changes size of data: No

	945 Allow as a non-last filter: Yes

	946 Allow as the last filter: No

	947

	948 Preferred alignment:

	949 Input data: 1 byte

	950 Output data: Same as the original input data

	951

	952 The Properties byte indicates the delta distance, which can be

	953 1-256 bytes backwards from the current byte: 0x00 indicates

	954 distance of 1 byte and 0xFF distance of 256 bytes.

	955

	956

	957 5.3.3.1. Format of the Encoded Output

	958

	959 The code below illustrates both encoding and decoding with

	960 the Delta filter.

	961

	962 // Distance is in the range [1, 256].

	963 const unsigned int distance = get_properties_byte() + 1;

	964 uint8_t pos = 0;

	965 uint8_t delta[256];

	966

	967 memset(delta, 0, sizeof(delta));

	968

	969 while (1) {

	970 const int byte = read_byte();

	971 if (byte == EOF)

	972 break;

	973

	974 uint8_t tmp = delta[(uint8_t)(distance + pos)];

	975 if (is_encoder) {

	976 tmp = (uint8_t)(byte) - tmp;

	977 delta[pos] = (uint8_t)(byte);

	978 } else {

	979 tmp = (uint8_t)(byte) + tmp;

	980 delta[pos] = tmp;

	981 }

	982

	983 write_byte(tmp);

	984 --pos;

	985 }

	986

	987

	988 5.4. Custom Filter IDs

	989

	990 If a developer wants to use custom Filter IDs, he has two

	991 choices. The first choice is to contact Lasse Collin and ask

	992 him to allocate a range of IDs for the developer.

	993

	994 The second choice is to generate a 40-bit random integer,

	995 which the developer can use as his personal Developer ID.

	996 To minimize the risk of collisions, Developer ID has to be

	997 a randomly generated integer, not manually selected "hex word".

	998 The following command, which works on many free operating

	999 systems, can be used to generate Developer ID:

	1000

	1001 dd if=/dev/urandom bs=5 count=1 \| hexdump

	1002

	1003 The developer can then use his Developer ID to create unique

	1004 (well, hopefully unique) Filter IDs.

	1005

	1006 Bits Mask Description

	1007 0-15 0x0000_0000_0000_FFFF Filter ID

	1008 16-55 0x00FF_FFFF_FFFF_0000 Developer ID

	1009 56-62 0x3F00_0000_0000_0000 Static prefix: 0x3F

	1010

	1011 The resulting 63-bit integer will use 9 bytes of space when

	1012 stored using the encoding described in Section 1.2. To get

	1013 a shorter ID, see the beginning of this Section how to

	1014 request a custom ID range.

	1015

	1016

	1017 5.4.1. Reserved Custom Filter ID Ranges

	1018

	1019 Range Description

	1020 0x0000_0300 - 0x0000_04FF Reserved to ease .7z compatibility

	1021 0x0002_0000 - 0x0007_FFFF Reserved to ease .7z compatibility

	1022 0x0200_0000 - 0x07FF_FFFF Reserved to ease .7z compatibility

	1023

	1024

	1025 6. Cyclic Redundancy Checks

	1026

	1027 There are several incompatible variations to calculate CRC32

	1028 and CRC64. For simplicity and clarity, complete examples are

	1029 provided to calculate the checks as they are used in this file

	1030 format. Implementations MAY use different code as long as it

	1031 gives identical results.

	1032

	1033 The program below reads data from standard input, calculates

	1034 the CRC32 and CRC64 values, and prints the calculated values

	1035 as big endian hexadecimal strings to standard output.

	1036

	1037 #include <stddef.h>

	1038 #include <inttypes.h>

	1039 #include <stdio.h>

	1040

	1041 uint32_t crc32_table[256];

	1042 uint64_t crc64_table[256];

	1043

	1044 void

	1045 init(void)

	1046 {

	1047 static const uint32_t poly32 = UINT32_C(0xEDB88320);

	1048 static const uint64_t poly64

	1049 = UINT64_C(0xC96C5795D7870F42);

	1050

	1051 for (size_t i = 0; i < 256; ++i) {

	1052 uint32_t crc32 = i;

	1053 uint64_t crc64 = i;

	1054

	1055 for (size_t j = 0; j < 8; ++j) {

	1056 if (crc32 & 1)

	1057 crc32 = (crc32 >> 1) ^ poly32;

	1058 else

	1059 crc32 >>= 1;

	1060

	1061 if (crc64 & 1)

	1062 crc64 = (crc64 >> 1) ^ poly64;

	1063 else

	1064 crc64 >>= 1;

	1065 }

	1066

	1067 crc32_table[i] = crc32;

	1068 crc64_table[i] = crc64;

	1069 }

	1070 }

	1071

	1072 uint32_t

	1073 crc32(const uint8_t *buf, size_t size, uint32_t crc)

	1074 {

	1075 crc = ~crc;

	1076 for (size_t i = 0; i < size; ++i)

	1077 crc = crc32_table[buf[i] ^ (crc & 0xFF)]

	1078 ^ (crc >> 8);

	1079 return ~crc;

	1080 }

	1081

	1082 uint64_t

	1083 crc64(const uint8_t *buf, size_t size, uint64_t crc)

	1084 {

	1085 crc = ~crc;

	1086 for (size_t i = 0; i < size; ++i)

	1087 crc = crc64_table[buf[i] ^ (crc & 0xFF)]

	1088 ^ (crc >> 8);

	1089 return ~crc;

	1090 }

	1091

	1092 int

	1093 main()

	1094 {

	1095 init();

	1096

	1097 uint32_t value32 = 0;

	1098 uint64_t value64 = 0;

	1099 uint64_t total_size = 0;

	1100 uint8_t buf[8192];

	1101

	1102 while (1) {

	1103 const size_t buf_size

	1104 = fread(buf, 1, sizeof(buf), stdin);

	1105 if (buf_size == 0)

	1106 break;

	1107

	1108 total_size += buf_size;

	1109 value32 = crc32(buf, buf_size, value32);

	1110 value64 = crc64(buf, buf_size, value64);

	1111 }

	1112

	1113 printf("Bytes: %" PRIu64 "\n", total_size);

	1114 printf("CRC-32: 0x%08" PRIX32 "\n", value32);

	1115 printf("CRC-64: 0x%016" PRIX64 "\n", value64);

	1116

	1117 return 0;

	1118 }

	1119

	1120

	1121 7. References

	1122

	1123 LZMA SDK - The original LZMA implementation

	1124 http://7-zip.org/sdk.html

	1125

	1126 LZMA Utils - LZMA adapted to POSIX-like systems

	1127 http://tukaani.org/lzma/

	1128

	1129 XZ Utils - The next generation of LZMA Utils

	1130 http://tukaani.org/xz/

	1131

	1132 [RFC-1952]

	1133 GZIP file format specification version 4.3

	1134 http://www.ietf.org/rfc/rfc1952.txt

	1135 - Notation of byte boxes in section "2.1. Overall conventions"

	1136

	1137 [RFC-2119]

	1138 Key words for use in RFCs to Indicate Requirement Levels

	1139 http://www.ietf.org/rfc/rfc2119.txt

	1140

	1141 [GNU-tar]

	1142 GNU tar 1.21 manual

	1143 http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html

	1144 - Node 9.4.2 "Blocking Factor", paragraph that begins

	1145 "gzip will complain about trailing garbage"

	1146 - Note that this URL points to the latest version of the

	1147 manual, and may some day not contain the note which is in

	1148 1.21. For the exact version of the manual, download GNU

	1149 tar 1.21: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.21.tar.gz

	1150

OLD	NEW

« no previous file with comments | « xz/doc/lzma-file-format.txt ('k') | xz/dos/Makefile » ('j') | no next file with comments »